UP -- Return to the contents page
NEXT -- Go to the next section.
BACK -- Go to the previous section.
Before normal clock speeds hit two digits in MHz, cache design wasn't a big issue. But DRAM's memory-cycle times just aren't fast enough to keep up with today's processors. Thus, your machine's memory controller caches memory references in faster static RAM (SRAM), reading from main memory in chunks that the board designer hopes will be large enough to keep the CPU continuously fed under a typical job load. If the cache system fails to work, the processor will be slowed down to less than the memory's real access speed --- which, given typical 70ns DRAM parts, is about 7MHz.
You'll sometimes hear the terms L1, L2, and L3 cache. These refer to Level 1, Level 2, and Level 3 cache. L1, today, is always on the CPU (well, unless you're HP). L2 is off-chip cache. L3 won't be found on PC class machines. Anything that will fit in L1 can be run at full CPU speed, as there is no need to go off chip. Anything (or things) too large to fit in L1 will try to run in L2. If it fits in L2, you still won't have to deal with the bus. If you're familiar with virtual memory, think of it this way: When you run out of L1, you swap to L2, when you run out of L2, you swap to main memory, when you run out of main memory, you swap to disk. Each stage is slower, and more prone to conflicts with other parts of the system, than what it follows.
Cache is like memory, but is faster, more expensive, and smaller. L1 is generally faster and smaller than L2, which is generally faster and smaller than L3, which is faster and smaller than memory. Some PCs will not have L2 or L3. Most workstation-class machines have L1 and L2. L3 is rare, even on big expensive Unix servers, but will become more common when CPUs start coming with L1 and L2 on-chip.
Because most L1 caches are on the CPU chip, there's isn't very much room for them so they tend to be small. It looks like the Pentium has 2 L1 caches, one for instructions (I-cache) and one for data (D-cache); each is 8 KB. If this is the only cache size available for Pentium, all laptops you look at will have this.
The size of the L2 cache you get will depend on what brand and model of laptop you buy, since Compaq and Fujitsu and NEC can decide independently how much L2 cache to put on their motherboards (within a range defined by the CPU chip). It's usually decided by the marketing people, not the technical people, based on what chips are available at what prices and what price they intend to sell the computer for. It looks like most benchmark results you'll see are with 256 or 512 KB of L2 cache; AT&T makes one Pentium-based server with 4 MB of L2 cache.
There are other cache-related buzzwords you may encounter.
"Write-back" means that when you update something in "memory" the cache doesn't actually push the new value out to the memory chips (or to L2, if it's an L1 write-back) until the "line" gets replaced in the cache. ("line" is the chunk-size caches act on, usually a small number of bytes like 8, 16, 32, or 64 for L1, or 32, 64, 128, or 256 for L2)
"Write-through" means that when you update "memory" the cache updates its value as well as sending an immediate update to physical memory (or to L2 if it's an L1 write-through). Write-back is generally faster if your application fits in the cache.
"Non-blocking, out of order" means that the CPU looks at the next N instructions it's about to execute. It executes the first and finds that the data isn't in cache. Since it's boring to just wait around for the data to come back from memory, it looks at the next instruction. If that 2nd instruction doesn't need the data the 1st instruction is waiting on, the CPU goes ahead and executes that instruction. If the 3rd instruction does need the data, it remembers it needs to execute that one after the data comes in and goes on to the 4th instruction. Depending on how many outstanding requests are allowed, if the 4th one causes a cache miss on a different line it may put that one on hold as well and go on to the 5th instruction. The Pentium Pro can do this, but I don't think the Pentium can.
"Set-associative" means the cache is split into 2 or more mini-caches. Because of the way things are accessed in a cache, this can help a program that has some "badly behaving" code mixed with some "good" code. Other terms that go with it are "LRU" (the mini-cache picked for replacement is the one Least Recently Used) or "random" (the line picked is selected randomly).
They can make a big difference in how happy you are with your system's performance. There are enough variables that you probably aren't going to be able to predict how happy you'll be with a configuration unless you sit down in front of the machine and run whatever it is you plan to run on it. Make up your own benchmark floppy with your primary application to take with you to showrooms. (Throw it away after all your test drives, since it will probably have collected a virus or three.)
Bigger or faster isn't always better. Speed is usually a tradeoff with size, and you have to match L2 cache size/speed to CPU speed. A system with a faster MHz CPU could perform worse than a system with a slower chip because the CPU<-->L2 speed match might be such that the faster CPU requires a different, slower mode on the L2 connection.
If all you want to do is run MyLittleSpreadsheet, and the code and data all fit in 400 KB, a system with 512 KB of L2 cache will likely run more than twice as fast as a system with 256 KB of L2. If MLS fits in 600 KB and has a very sequential access pattern (a "cache-buster"), the 128 KB and 256 KB systems will perform about the same -- like a dog; if the pattern is random rather than sequential, the 512KB system will probably do some fractional amount better than the 128 KB system. This is why it's so important to try out your application and ignore impressive numbers for programs you're never going to run.
Also, you may find http://www.complang.tuwien.ac.at/misc/doombench.html useful; it's the Doom benchmark page :-)
One side-effect of what's today considered "good programming practice", with high-level languages using a lot of subroutine calls, is that the program counter of a typical process hops around like crazy. You might think that this, together with the periodic clock interrupts for multitasking, would make spatial locality very poor.
However, the clock interrupt only fires about 60 times per second. This is a very low overhead, if you consider how many instructions can be exectuted at 60 MHz in 1/60th of a second (for a poor estimate, something like 30 MIPS * 1/60 = half a million instructions--at 16 bits each, roughly a megabyte of memory has been walked through!). This is lots of opportunity to take advantage of temporal locality -- and most programs are not so large that their time-critical parts won't fit inside a megabyte. (Thanks to Michael T. Pins and Joan Eslinger for much of this section.)
Modern system designs have two levels of caching; a primary or internal cache right on the chip, and a secondary or external cache in high-speed memory (typically static RAM) off-chip. The internal cache feeds the processor directly; the external cache feeds the internal cache.
A cache is said to hit when the processor, or a higher-level cache, calls for a particular memory location and gets it. Otherwise, it misses and has to go to main memory (or at least the next lower level of cache) to fetch the contents of the location. A cache's hit rate is the percentage of time, considered as a moving average, that it hits.
The external cache is added to reduce the cost of an internal cache miss. To speed the whole process up, it must serve the internal cache faster than main memory would be able to do (to hide the slowness of main memory). Thus, we desire a very high hit rate in the secondary cache as well as very high bandwidth to the processor.
Obviously, secondary cache hit rate can be improved by making it bigger. It can also be increased by increasing the associativity factor (more on this later, but for now note that too much associativity can cost a big penalty).
A cache is divided up into lines. Typically, in an i486 system, each line is 4 to 16 bytes long (the i486 internal cache uses 16-byte lines; external line size varies). When the processor reads from an external-cache address that is not in the internal cache, that address and the surrounding 16 bytes are read into a line.
Each cache line has a tag associated with it. The tag stores the address in memory that the data in that cache line came from. (Plus a bit to indicate that this line contains valid data).
Some more important terms describing how caches interact with memory:
Even when the secondary cache line being replaced is not dirty, the service time goes up because the dirty bit must first be examined before accessing to main memory. Write-through caches have the advantage of being able to look up data in the secondary cache and in main memory in parallel (in the case where the secondary cache misses, some of the delay of looking in main memory has already been taken away). (Write-back caches cannot do this because they might have to write-back the cache line before doing the main memory read.)
For these reasons, write-back caches are generally regarded as being inferior to write-posting buffers. They cost too much silicon and more often than not perform worse.
Now some terms that describe cache organization. To understand these, you need to think of main RAM as being divided into consecutive, non-overlapping segments we'll call "regions". A typical region size is 64K. Each region is mapped to a cache line, 4 to 128 bytes in size (a typical size is 16). When the processor reads from an address in a given region, and that address is not already in core, the location and others near it are read into a line.
Because set-associative caches make better use of SRAM, they typically require less SRAM than a direct-mapped cache for equivalent performance. They're also less vulnerable to UNIX's heavy memory usage. Andy Glew of USENET's comp.arch group says "the usual rule of thumb is that a 4-way set-associative cache is equivalent to a direct-mapped cache of twice the size". On the other hand, some claim that as cache size gets larger, two-way associativity becomes less useful. According to this school of thought it actually becomes a net loss over a direct-mapped cache at cahe sizes over 256K.
So, typically, you see multi-set cache designs on internal caches, but direct-mapped designs for external caches.
The 486's 8K internal primary cache is typically supplemented with an external caching system (L2) using SRAM to reduce the cost of an internal cache miss; in November 1994, 20ns SRAM is typical.
What varies between motherboards is the design of this secondary cache.
Because the 486 has write-posting buffers, a write-through external cache is OK. Remember that the i486 already has a 4-deep write-buffer. Writes are done when there is little bus activity (most common case) or when the buffer is full. The write-buffer will write out its contents while the i486 CPU is still crunching away from the internal cache. Thus, a write-through has little negative impact on write performance. If the buffer was usually full, then the i486 would be stalling while the write is done (and you would then want a write-back cache OR a better choice would be another external write buffer).
Also, recall that one of the goals of the secondary cache is to have very high effective bandwidth to the i486 processor. Any associativity greater than direct-mapped measurably increases the lookup time in the secondary cache and increases the time to service an internal cache miss, and thus reduces the effective bandwidth to the i486. Thus, direct-mapped secondary caches provide a lower $$ cost AND increase the performance in the common case of a secondary cache hit, even though the hit rate has been slightly reduced by not adding the associativity.
In the presence of interleaved DRAM memory, a cache line should not be larger than a whole DRAM line -- double interleaved: 2*4 bytes, quadruple 4*4 bytes. Otherwise, memory fetches to the external cache get slow.
Bela Lubkin writes: "Excess RAM [over what your cache can support] is a very bad idea: most designs prevent memory outside the external cache's cachable range from being cached by the 486 internal cache either. Code running from this memory runs up to 11 times slower than code running out of fully cached memory."
A more sophisticated way of determining cache size is to estimate the number of processes you expect to be running simultaneously (ie 1 + expected load average, let this value be N). Your external cache should be about N * 32k in size. The justification for this is as follows: upon a context switch, it is a good idea to be able to hold the entire i486 internal cache in the secondary cache. For each process you would need something less than 8k * 4 (since it is 4-set associative, you need 32k to help map the conflicting cache lines, and the extra cache left over (24k) should be plenty to help improve the hit rate of the secondary cache when the internal cache misses. The number of main memory accesses caused by context switching should be reduced.
Of course, if you are going to be running programs with large memory requirements (especially data), then a huge secondary cache would probably be a big win. But most programs in the run queue will be small though (ls, cat, more, etc.).
Gerard Lemieux qualifies this by observing that if adding SRAM increases the external cache line size, rather than increasing the number of cache lines, it's a lose. If this is the case, then an external cache miss could cost you dearly; imagine how long the processor would have to wait if the line size grew to 1024 bytes. If the cache has a poor hit rate (likely true, since the number of lines has not changed), performance would deteriorate.
Also (he observes), spending an additional $250 for cache chips might buy you 2-3% in performance (even in UNIX). You must ask yourself if it is really worth it.