What's the theory and measurements behind cache line sizes?

Asked 30/3, 2016 at 15:11 Answered 30/1, 2023 at 20:41

Cache lines are often 64 bytes, other sizes also exist.

My very simple question is: is there any theory behind this number, or is it just the result of the vast amount of tests and measurements that engineers behind it undoubtedly do?

Either way, I was wondering what those (the theory, if there is one, and kinds of tests behind the decision) are.

Unbearable answered 30/3, 2016 at 15:11 Comment(2)

The smaller the cache lines are, the more space the tags take up. I suspect that larger last-level caches were one of the big motivations for increasing the cache line size from 32 (Pentium III) to 64 (current). I assume that pretty much cuts the tag die area in half. – Hatteras 30/3, 2016 at 17:56

64 bytes is also the max size of a (DDR) SDRAM burst read or write, but that was probably chosen to match current (half-length) and future (full length) CPUs, moreso than the other way around. (en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory). Also related in general, How much of ‘What Every Programmer Should Know About Memory’ is still valid? – Hatteras 30/1, 2023 at 21:13

In general microarchitectural parameters tend to be tuned via performance modeling rather than some sort of theoretical model. That is to say there isn't anything like "big O" that is used to characterize the performance of algorithms. Instead benchmarks are run using performance simulators and this is used to guide the choice of the best parameters.

That having been said there are a few reasons why cache line size is going to be fairly stable in an established architecture:

Size is a power of 2: The line size should be a power of 2 in order to simplify addressing, so this limits the number of possible choices for cache line size.
Software is optimized based on cache parameters: Many microarchitectural parameters are completely hidden from the programmer. But the cache line size is one that is visible, and can have a significant impact on performance for some applications. Once programmers have optimized their code for a 64-byte cache line size then the processor architects have an incentive to keep this same cache line size in future processors, even if the underlying technology changed in a way that made a different size cache line easier to implement in hardware.
Cache coherence interacts with cache line: The verification of cache coherence protocols is extremely difficult, and cache coherence is a source of many bugs in processors. Coherence is tracked at the cache line level, so changing the cache line would require redoing all of the validation steps for a coherence protocol. So there would need to be a strong motivation for changing this parameters.
Changing cache line size could introduce false sharing: This is a special case of software being optimized based on cache parameters, but I think it is worth mentioning. Parallel programs are difficult to write in a way that actually provides performance benefits. Since data is tracked at the cache line granularity it is important to avoid false sharing. If the cache line size changed from one processor generation to another this could cause false sharing in the new processor that did not exist in the old one.

Although 64 bytes is the line size used for x86 and most ARM processors, there are other line sizes in use. For instance MIPS has many processors that have a 32 byte line size, and some that have 16 byte line size.

The line size is tuned to some degree to give the best performance for the workloads that the architecture is expected to run. However, once a line size is selected, and significant amounts of software have been written for the architecture, then the line size is unlikely to change in the future, for the reasons that I listed above.

Helium answered 30/3, 2016 at 20:35 Comment(8)

+1 for benchmarking and tuning. It saddens my heart that programmers optimize for a specific cache line size when it is visible and programs would be able to adapt with little effort. But I guess them bastards are lazy... – Calcimine 31/3, 2016 at 5:56

@Andreas: Making loops with compile-time-constant bounds can help compilers optimize better, much better in some cases (e.g. loop unrolling). It would be a lot of work to write code in a way that let the compiler know that a variable could only be a power of two greater than 32, for example, without including run-time checks in the code. – Hatteras 31/3, 2016 at 14:33

@PeterCordes Interesting point. I thought compilers did not require compile time information of variable ranges. And besides, is not variable ranges decided from functional requirements rather than platform properties? Maybe I´m putting to much faith in compiler optimization capabilites. – Calcimine 31/3, 2016 at 15:31

@Andreas: They don't require it, but if they have it, they can optimize even more. Compile-time constants can even end up folded right into instructions instead of tying up a register. e.g. add rsi, 64 to increment a pointer by 64. The 64 is a byte in the instruction stream that's part of the instruction. The more stuff a compiler can prove about variable values, the better it can optimize. – Hatteras 31/3, 2016 at 15:56

@PeterCordes When I say "compile time information" I mean the information provided in the code.The compiler derive many "constants" on its own based on my code. For example if I loop through a c-string to retrieve its length. The function itself is built for any length but my strings are of constant length. Hence my calls to that function could be sort of "inlined" and thus the function is executed as if it knew its constants (at the cost of code length, ofc). And the aggresiveness of the compiler is usually configurable. I simply cannot imagine a case where those constants cannot be derived. – Calcimine 31/3, 2016 at 16:13

@Andreas: I thought you were suggesting that programs with tuning for cache should get the current system's cache parameters at run time, e.g. by running CPUID on x86, or more generally sysconf (_SC_LEVEL1_DCACHE_LINESIZE). Not all x86 CPUs have the same cache-line size; it's not architecturally defined so implementations are free to have whatever line size they want. This even affects the operation of the clfush instruction (cache-line flush). – Hatteras 31/3, 2016 at 19:16

I think what you actually meant was that programs should make it easy to recompile them for different cache line sizes (with at #define or static constexpr int clsize = 64, instead of hard-coding it all over the place. I agree with that. There might be cases that justify only optimizing for one size in the source, e.g. where some trick only makes sense with 64B or larger cache lines, and nobody has actually written the code for the case where the trick isn't a good idea. – Hatteras 31/3, 2016 at 19:18

It's not so much that programmers "optimize" for a cache line size so much as they "avoid doing bad things given a cache line size". A perfect example is that of false sharing this answer refers to. Don't store your mutex in the same cacheline as other data being accessed or you'll have a lot of false sharing. – Eyecatching 30/1, 2023 at 20:5

The history of cache line sizes is a bit convoluted (as with many microarchitectural parameters). Originally, the cache line size was made to match the bus size of the processor. The thinking was that if a read or write was done on the bus, it might as well fill the data bus.

As caches got bigger, the sizes of cache lines increased for a few reasons:

Take advantage of locality in certain cases.
Indexing overhead can be kept low <--- this one is actually pretty important.

The larger the cacheline size, the fewer lines you need to keep track of inside the cache for an equivalently sized cache. For larger caches (multi-MB) this can reduce the lookup/compare times.

There are also some performance advantages (depending on the workload) to a larger cacheline size. But it's not entirely clear (let's take Spec2k17) that it's always a win. Sometimes a larger cacheline size introduces more waste since the program has low spacial locality.

Note that you don't need to have a single cache line size for all levels of cache. You can have 32B cache lines for the L1. 64B for the L2 and 128B for the L3/LLC if you wanted to. It's more work to keep track of partial lines but lets you utilize each level of cache effectively.

Eyecatching answered 30/1, 2023 at 20:41 Comment(2)

Do you have a source for early CPUs using tiny cache lines like 32 bits to match their bus widths? That would be a lot of tag overhead per data. MIPS R2000's controller for external L1i/d caches could support line sizes as small as 4B since it fetches a word and tag from cache, but I assume that configuration wasn't used in real systems. Instead you'd have maybe 4 different data word addresses index the same tag in the SRAM chips you were using for tag storage. – Hatteras 30/1, 2023 at 21:7

Different line sizes in different levels would seriously complicate write-back, wouldn't it? Can you just merge the new half line with the unchanged other line? Only if the outer cache was inclusive, otherwise it would have to fill the other half before you could evict a dirty L1d line into half of an L2 line. – Hatteras 30/1, 2023 at 21:11

Recommended topics

Hot tags