(This document is part of the PC-Clone UNIX Hardware Buyer's Guide. The Guide is maintained by Eric S. Raymond please email comments and corrections to him.)

UP -- Return to the contents page
NEXT -- Go to the next section.
BACK -- Go to the previous section.

Cache Flow

Overview

The most obscure of the important factors in the performance of a clone system is the motherboard's memory subsystem design, consisting both of primary and secondary cache and DRAM. The two questions performance-minded buyers have to deal with are: (1) does the cache design of a given motherboard work with UNIX, and (2) how much cache SRAM should my system have, and how should it be organized?

Before normal clock speeds hit two digits in MHz, cache design wasn't a big issue. But DRAM's memory-cycle times just aren't fast enough to keep up with today's processors. Thus, your machine's memory controller caches memory references in faster static RAM (SRAM), reading from main memory in chunks that the board designer hopes will be large enough to keep the CPU continuously fed under a typical job load. If the cache system fails to work, the processor will be slowed down to less than the memory's real access speed --- which, given typical 70ns DRAM parts, is about 7MHz.

You'll sometimes hear the terms L1, L2, and L3 cache. These refer to Level 1, Level 2, and Level 3 cache. L1, today, is always on the CPU (well, unless you're HP). L2 is off-chip cache. L3 won't be found on PC class machines. Anything that will fit in L1 can be run at full CPU speed, as there is no need to go off chip. Anything (or things) too large to fit in L1 will try to run in L2. If it fits in L2, you still won't have to deal with the bus. If you're familiar with virtual memory, think of it this way: When you run out of L1, you swap to L2, when you run out of L2, you swap to main memory, when you run out of main memory, you swap to disk. Each stage is slower, and more prone to conflicts with other parts of the system, than what it follows.

Cache is like memory, but is faster, more expensive, and smaller. L1 is generally faster and smaller than L2, which is generally faster and smaller than L3, which is faster and smaller than memory. Some PCs will not have L2 or L3. Most workstation-class machines have L1 and L2. L3 is rare, even on big expensive Unix servers, but will become more common when CPUs start coming with L1 and L2 on-chip.

Because most L1 caches are on the CPU chip, there's isn't very much room for them so they tend to be small. It looks like the Pentium has 2 L1 caches, one for instructions (I-cache) and one for data (D-cache); each is 8 KB. If this is the only cache size available for Pentium, all laptops you look at will have this.

The size of the L2 cache you get will depend on what brand and model of laptop you buy, since Compaq and Fujitsu and NEC can decide independently how much L2 cache to put on their motherboards (within a range defined by the CPU chip). It's usually decided by the marketing people, not the technical people, based on what chips are available at what prices and what price they intend to sell the computer for. It looks like most benchmark results you'll see are with 256 or 512 KB of L2 cache; AT&T makes one Pentium-based server with 4 MB of L2 cache.

There are other cache-related buzzwords you may encounter.

"Write-back" means that when you update something in "memory" the cache doesn't actually push the new value out to the memory chips (or to L2, if it's an L1 write-back) until the "line" gets replaced in the cache. ("line" is the chunk-size caches act on, usually a small number of bytes like 8, 16, 32, or 64 for L1, or 32, 64, 128, or 256 for L2)

"Write-through" means that when you update "memory" the cache updates its value as well as sending an immediate update to physical memory (or to L2 if it's an L1 write-through). Write-back is generally faster if your application fits in the cache.

"Non-blocking, out of order" means that the CPU looks at the next N instructions it's about to execute. It executes the first and finds that the data isn't in cache. Since it's boring to just wait around for the data to come back from memory, it looks at the next instruction. If that 2nd instruction doesn't need the data the 1st instruction is waiting on, the CPU goes ahead and executes that instruction. If the 3rd instruction does need the data, it remembers it needs to execute that one after the data comes in and goes on to the 4th instruction. Depending on how many outstanding requests are allowed, if the 4th one causes a cache miss on a different line it may put that one on hold as well and go on to the 5th instruction. The Pentium Pro can do this, but I don't think the Pentium can.

"Set-associative" means the cache is split into 2 or more mini-caches. Because of the way things are accessed in a cache, this can help a program that has some "badly behaving" code mixed with some "good" code. Other terms that go with it are "LRU" (the mini-cache picked for replacement is the one Least Recently Used) or "random" (the line picked is selected randomly).

They can make a big difference in how happy you are with your system's performance. There are enough variables that you probably aren't going to be able to predict how happy you'll be with a configuration unless you sit down in front of the machine and run whatever it is you plan to run on it. Make up your own benchmark floppy with your primary application to take with you to showrooms. (Throw it away after all your test drives, since it will probably have collected a virus or three.)

Bigger or faster isn't always better. Speed is usually a tradeoff with size, and you have to match L2 cache size/speed to CPU speed. A system with a faster MHz CPU could perform worse than a system with a slower chip because the CPU<-->L2 speed match might be such that the faster CPU requires a different, slower mode on the L2 connection.

If all you want to do is run MyLittleSpreadsheet, and the code and data all fit in 400 KB, a system with 512 KB of L2 cache will likely run more than twice as fast as a system with 256 KB of L2. If MLS fits in 600 KB and has a very sequential access pattern (a "cache-buster"), the 128 KB and 256 KB systems will perform about the same -- like a dog; if the pattern is random rather than sequential, the 512KB system will probably do some fractional amount better than the 128 KB system. This is why it's so important to try out your application and ignore impressive numbers for programs you're never going to run.

Also, you may find http://www.complang.tuwien.ac.at/misc/doombench.html useful; it's the Doom benchmark page :-)

How Caching Works

Caches get their leverage from exploiting two kinds of locality in memory references. Temporal locality means "access now, it should be accessed again soon", while spatial means "if byte N was asked for, byte N+1, N+2, N+3 will probably be wanted too". Because UNIX multitasks, every context switch violates both types of locality. Spatial is impacted because contexts may not be located closely together. Temporal is impacted because you have to wait until all other ready contexts get their chance to run before you can run again.

One side-effect of what's today considered "good programming practice", with high-level languages using a lot of subroutine calls, is that the program counter of a typical process hops around like crazy. You might think that this, together with the periodic clock interrupts for multitasking, would make spatial locality very poor.

However, the clock interrupt only fires about 60 times per second. This is a very low overhead, if you consider how many instructions can be exectuted at 60 MHz in 1/60th of a second (for a poor estimate, something like 30 MIPS * 1/60 = half a million instructions--at 16 bits each, roughly a megabyte of memory has been walked through!). This is lots of opportunity to take advantage of temporal locality -- and most programs are not so large that their time-critical parts won't fit inside a megabyte. (Thanks to Michael T. Pins and Joan Eslinger for much of this section.)

A Short Primer on Cache Design

Before we go further in discussing specifics of the Intel processors we'll need some basic cache-design theory. (Some of this repeats and extends the Overview.)

Modern system designs have two levels of caching; a primary or internal cache right on the chip, and a secondary or external cache in high-speed memory (typically static RAM) off-chip. The internal cache feeds the processor directly; the external cache feeds the internal cache.

A cache is said to hit when the processor, or a higher-level cache, calls for a particular memory location and gets it. Otherwise, it misses and has to go to main memory (or at least the next lower level of cache) to fetch the contents of the location. A cache's hit rate is the percentage of time, considered as a moving average, that it hits.

The external cache is added to reduce the cost of an internal cache miss. To speed the whole process up, it must serve the internal cache faster than main memory would be able to do (to hide the slowness of main memory). Thus, we desire a very high hit rate in the secondary cache as well as very high bandwidth to the processor.

Obviously, secondary cache hit rate can be improved by making it bigger. It can also be increased by increasing the associativity factor (more on this later, but for now note that too much associativity can cost a big penalty).

A cache is divided up into lines. Typically, in an i486 system, each line is 4 to 16 bytes long (the i486 internal cache uses 16-byte lines; external line size varies). When the processor reads from an external-cache address that is not in the internal cache, that address and the surrounding 16 bytes are read into a line.

Each cache line has a tag associated with it. The tag stores the address in memory that the data in that cache line came from. (Plus a bit to indicate that this line contains valid data).

Some more important terms describing how caches interact with memory:

write-through: it wouldn't do to let your cache get out of sync with main memory. The simplest way to handle this is to arrange that every write to cache generates the corresponding write to main store. In the simplest kind of "write-through" cache, then, you only get cache speedup on reads. But if the cache system includes write postings, writes will usually be fast too.
write posting: Most write-through cache designs have a few `write posting' buffers that allow the cache to accept write data as fast as a write back cache for later writing to memory. So, as far as the processor is concerned, most writes will happen at zero wait states (the full cache speed), unless the processor issues enough writes in a short interval to cause the write posting buffers to fill up faster than they can be emptied.
write-back: For each cache address range in DRAM, writes are done to cache only until a new block has to be fetched due to out-of-range access (at which point the old one is flushed to DRAM). This is much faster, because you get cache speedup on writes as well as reads. It's also more expensive and trickier to get right.

Write-back secondary caches are generally not a good idea. Beyond what was said in the write-through paragraph above, recall that the goal of the secondary cache is to have a high hit rate and high bandwidth to the processor's internal cache. When a cache-miss occurs in the secondary cache, often the line being replaced is dirty and must be written to main memory first. The total time to service the secondary cache miss nearly doubles.

Even when the secondary cache line being replaced is not dirty, the service time goes up because the dirty bit must first be examined before accessing to main memory. Write-through caches have the advantage of being able to look up data in the secondary cache and in main memory in parallel (in the case where the secondary cache misses, some of the delay of looking in main memory has already been taken away). (Write-back caches cannot do this because they might have to write-back the cache line before doing the main memory read.)

For these reasons, write-back caches are generally regarded as being inferior to write-posting buffers. They cost too much silicon and more often than not perform worse.

Now some terms that describe cache organization. To understand these, you need to think of main RAM as being divided into consecutive, non-overlapping segments we'll call "regions". A typical region size is 64K. Each region is mapped to a cache line, 4 to 128 bytes in size (a typical size is 16). When the processor reads from an address in a given region, and that address is not already in core, the location and others near it are read into a line.

direct-mapped: Describes a cache system in which each region has exactly one corresponding line (also called "one-way cache").
two-way set-associative: Each region has *two* possible slots; thus your odds of not having to fetch from DRAM (and, hence, your effective speed) go up.

There are also "four-way" caches. In general, an n-way cache has n pages per region and improves your effective speed by some factor proportional to n. However, multiset caches become very costly in terms of silicon real estate, so one does not commonly see five-way or higher caches.

Because set-associative caches make better use of SRAM, they typically require less SRAM than a direct-mapped cache for equivalent performance. They're also less vulnerable to UNIX's heavy memory usage. Andy Glew of USENET's comp.arch group says "the usual rule of thumb is that a 4-way set-associative cache is equivalent to a direct-mapped cache of twice the size". On the other hand, some claim that as cache size gets larger, two-way associativity becomes less useful. According to this school of thought it actually becomes a net loss over a direct-mapped cache at cahe sizes over 256K.

So, typically, you see multi-set cache designs on internal caches, but direct-mapped designs for external caches.

Cache Engineering on 486 Machines

The 8K primary cache (L1) on the 486 is write-through, four-way set-associative with a four-deep write posting buffer. It's been argued that the 486 is more limited by that shallow write-posting buffer than by cache miss overhead.

The 486's 8K internal primary cache is typically supplemented with an external caching system (L2) using SRAM to reduce the cost of an internal cache miss; in November 1994, 20ns SRAM is typical.

What varies between motherboards is the design of this secondary cache.

486 External Caches are Usually Direct-Mapped

Given the realities of the clone market, where buyers have been trained to assume that a bigger number is always better, it is also easier to simply advertise the 128K of direct mapped cache.

Because the 486 has write-posting buffers, a write-through external cache is OK. Remember that the i486 already has a 4-deep write-buffer. Writes are done when there is little bus activity (most common case) or when the buffer is full. The write-buffer will write out its contents while the i486 CPU is still crunching away from the internal cache. Thus, a write-through has little negative impact on write performance. If the buffer was usually full, then the i486 would be stalling while the write is done (and you would then want a write-back cache OR a better choice would be another external write buffer).

Also, recall that one of the goals of the secondary cache is to have very high effective bandwidth to the i486 processor. Any associativity greater than direct-mapped measurably increases the lookup time in the secondary cache and increases the time to service an internal cache miss, and thus reduces the effective bandwidth to the i486. Thus, direct-mapped secondary caches provide a lower $$ cost AND increase the performance in the common case of a secondary cache hit, even though the hit rate has been slightly reduced by not adding the associativity.

External Cache Line Size

The larger you make a cache line, the cheaper the design will be, as you save on expensive tag ram, but the worse the performance will be, as you pay for each cache miss manyfold. It is not reasonable to have 32 bytes reload on a cache miss.

In the presence of interleaved DRAM memory, a cache line should not be larger than a whole DRAM line -- double interleaved: 2*4 bytes, quadruple 4*4 bytes. Otherwise, memory fetches to the external cache get slow.

Other Considerations

An external cache which can support the i486 burst mode can increase bandwidth to a much higher level than one which doesn't, and can significantly reduce the cost of an internal cache miss.

Suggestions for Buying

The best advice your humble editor can give is a collection of rules of thumb. Your mileage may vary...

Rule 1: Buy only motherboards that have been tested with UNIX

One of DOS's many sins is that it licenses poor hardware design; it's too brain-dead to stretch the cache system much. Thus, bad cache designs that will run DOS can completely hose UNIX, slowing the machine to a crawl or even (in extreme cases) causing frequent random panics. Make sure your motherboard or system has been tested with some UNIX variant.

Rule 2: Be sure you get enough cache.

If your motherboard offers multiple cache sizes, make sure you how much is required to service the DRAM you plan to install.

Bela Lubkin writes: "Excess RAM [over what your cache can support] is a very bad idea: most designs prevent memory outside the external cache's cachable range from being cached by the 486 internal cache either. Code running from this memory runs up to 11 times slower than code running out of fully cached memory."

Rule 3: "Enough cache" is at least 64K per 16MB of DRAM

Hardware caches are usually designed to achieve effective 0 wait state status, rather than perform any significant buffering of data. As a general rule, 64Kb cache handles up to 16Mb memory; more is redundant.

A more sophisticated way of determining cache size is to estimate the number of processes you expect to be running simultaneously (ie 1 + expected load average, let this value be N). Your external cache should be about N * 32k in size. The justification for this is as follows: upon a context switch, it is a good idea to be able to hold the entire i486 internal cache in the secondary cache. For each process you would need something less than 8k * 4 (since it is 4-set associative, you need 32k to help map the conflicting cache lines, and the extra cache left over (24k) should be plenty to help improve the hit rate of the secondary cache when the internal cache misses. The number of main memory accesses caused by context switching should be reduced.

Of course, if you are going to be running programs with large memory requirements (especially data), then a huge secondary cache would probably be a big win. But most programs in the run queue will be small though (ls, cat, more, etc.).

Rule 4: If possible, max out the board's cache when you first buy

Bela continues: "Get the largest cache size your motherboard supports, even if you're not fully populating it with RAM. The motherboard manufacturer buys cache chips in quantity, knows how to install them correctly, and you won't end up throwing out the small chips later when you upgrade your main RAM."

Gerard Lemieux qualifies this by observing that if adding SRAM increases the external cache line size, rather than increasing the number of cache lines, it's a lose. If this is the case, then an external cache miss could cost you dearly; imagine how long the processor would have to wait if the line size grew to 1024 bytes. If the cache has a poor hit rate (likely true, since the number of lines has not changed), performance would deteriorate.

Also (he observes), spending an additional $250 for cache chips might buy you 2-3% in performance (even in UNIX). You must ask yourself if it is really worth it.

Caveat

A lot of fast chips are held back by poor cache systems and slow memory. The Pentium 60 has a particular problem this way, because its cycle speed is as fast as that of a 20ns cache SRAM. To avoid trouble, cloners often insert wait states at the cache, slowing down the chip to the effective speed of a 486DX2/66 or slower. (Thanks to Guy Gerard Lemieux <[email protected]> for insightful comments on an earlier version of this section.)

Eric S. Raymond