How do cache lines work?

Asked 13/10, 2010 at 23:56 Answered 30/8, 2016 at 10:28

Solved caching memory cpu-architecture cpu-cache processor

229

I understand that the processor brings data into the cache via cache lines, which - for instance, on my Atom processor - brings in about 64 bytes at a time, whatever the size of the actual data being read.

My question is:

Imagine that you need to read one byte from memory, which 64 bytes will be brought into the cache?

The two possibilities I can see is that, either the 64 bytes start at the closest 64 bytes boundary below the byte of interest, or the 64 bytes are spread around the byte in some predetermined way (for instance, half under, half above, or all above).

Which is it?

Atrip answered 13/10, 2010 at 23:56 Comment(1)

Read this: What every programmer should know about memory. Then read it again. Better (pdf) source here. – Eury 16/10, 2010 at 1:35

191

If the cache line containing the byte or word you're loading is not already present in the cache, your CPU will request the 64 bytes that begin at the cache line boundary (the largest address below the one you need that is multiple of 64).

Modern PC memory modules transfer 64 bits (8 bytes) at a time, in a burst of eight transfers, so one command triggers a read or write of a full cache line from memory. (DDR1/2/3/4 SDRAM burst transfer size is configurable up to 64B; CPUs will select the burst transfer size to match their cache line size, but 64B is common)

As a rule of thumb, if the processor can't forecast a memory access (and prefetch it), the retrieval process can take ~90 nanoseconds, or ~250 clock cycles (from the CPU knowing the address to the CPU receiving data).

By contrast, a hit in L1 cache has a load-use latency of 3 to 5 cycles, and a store-reload has a store-forwarding latency of 4 or 5 cycles on modern x86 CPUs. Things are similar on other architectures.

Further reading: Ulrich Drepper's What Every Programmer Should Know About Memory. The DRAM and cache details are still relevant. See also How much of ‘What Every Programmer Should Know About Memory’ is still valid? - The software-prefetch advice is a bit outdated: modern HW prefetchers are smarter, and hyperthreading is way better than in P4 days (so a prefetch thread is typically a waste). Also, the x86 tag wiki has lots of performance links for that architecture.

Consentient answered 16/10, 2010 at 2:32 Comment(10)

Why isn't there a 64 byte wide data bus, that allows to bring in the whole cache line at once? It seams really inefficient to do this in 8 byte steps. – Hinduism 27/8, 2016 at 13:37

This answer makes absolutely no sense. What does the 64bit memory bandwidth (which is also wrong in that regard) to do with the 64 byte(!) not bit to do? Also the 10 to 30 ns are also totally wrong if you hit the Ram. It might be true for the L3 or L2 cache but not for the RAM where it is more like 90ns. What you mean is the burst time - the time to access the next quad-word in burst mode (which is actually the correct answer) – Illiteracy 30/8, 2016 at 9:40

@MartinKersten: One channel of DDR1/2/3/4 SDRAM does use a 64-bit data bus width. A burst transfer of a whole cache line does take eight transfers of 8B each, and is what actually happens. It might still be correct that the process is optimized by transferring the 8B-aligned chunk containing the desired byte first, i.e. starting the burst there (and wrapping around if it wasn't the first 8B of the burst transfer size). Modern CPUs with multi-level caches probably don't do that anymore, though, because it would mean relaying the first block(s) of the burst up to L1 cache early. – Intercept 30/8, 2016 at 10:50

Haswell has a 64B path between L2 and L1D cache (i.e. a full cache line width), so transferring the 8B containing the requested byte would make for inefficient use of that bus. @Martin is also correct about the access time for a load that has to go to main memory. – Intercept 30/8, 2016 at 10:54

@Peter Form the answer it was hard to draw the connection between 64bit and 64Byte cacheline. The burst will not start with the desired 8byte (or something has changed) there is no way in telling which part of a cache is already valid or not but I am not into this anymore. I even have no glue if the data travels upwards to all caching level (which would be reasonable). I mean if I request a cache line than first it enters L3, than L2 and than L1. Does anyone know this? – Illiteracy 31/8, 2016 at 15:32

That is a nice information. Makes sense to be able to transfer an entire cache line each cycle. The 30ns I am still unsure. But again my information may be out-dated and well I apologize for that. As my understanding was, DDR4 is just 20% faster but has twice the bandwidth so waiting for the cache-line should not have become that multiple times faster... but again I might have missed the point (and the mark) so cudos for you. – Illiteracy 31/8, 2016 at 15:39

@Martin: I think CPUs that use cut-through for the needed bytes from the whole cache-line would have the memory controller supply the bytes to the cache and the load unit in parallel. So there's never a situation where you're taking only the valid bytes from an actual cache line. Keeping multiple parts of the CPU busy doing the same thing sounds more like an in-order design; probably few out-of-order designs do this. (And I agree, real x86 CPUs don't do it.) – Intercept 1/9, 2016 at 0:42

Good question about whether data goes all the way up the memory hierarchy at once, or whether L3 waits for a full line from memory before starting to send it up to L2. There are transfer buffers between different levels of cache, and each outstanding miss claims one. So (total guesswork) probably L3 puts bytes from the memory controller into its own receive buffer at the same time as putting them into the appropriate load buffer for the L2 cache that wanted it. When the line is fully transferred from memory, L3 notifies the L2 that the line is ready, and copies it into its own array. – Intercept 1/9, 2016 at 0:47

@Martin: I decided to go ahead and edit this answer. I think it's more accurate now, and still simple. Future readers: see also Mike76's question and my answer: #39182560 – Intercept 1/9, 2016 at 1:17

@PeterCordes Hi Peter, would really appreciate it if you could take a look at this related question https://mcmap.net/q/56018/-should-i-align-data-to-their-data-type-or-cpu-cache-line-size/14940626 – Humble 31/7, 2023 at 16:5

First of all a main memory access is very expensive. Currently a 2GHz CPU (the slowest once) has 2G ticks (cycles) per second. A CPU (virtual core nowadays) can fetch a value from its registers once per tick. Since a virtual core consists of multiple processing units (ALU - arithmetic logic unit, FPU etc) it can actually process certain instructions in parallel if possible.

An access of the main memory costs about 70ns to 100ns (DDR4 is slightly faster). This time is basically looking up the L1, L2 and L3 cache and than hit the memory (send command to memory controller, which sends it to the memory banks), wait for the response and done.

100ns means about 200 ticks. So basically if a program would always miss the caches which each memory access, the CPU would spend about 99,5% of its time (if it only reads memory) idle waiting for the memory.

In order to speed things up there is the L1, L2, L3 caches. They use memory being directly placed on the chip and using a different kind of transistor circuits to store the given bits. This takes more room, more energy and is more costly than the main memory since a CPU is usually produced using a more advanced technology and a production failure in the L1, L2, L3 memory has the chance to render the CPU worthless (defect) so large L1, L2, L3 caches increase the error rate which decreases the yield which directly decreases ROI. So there is a huge trade off when it comes to available cache size.

(currently one creates more L1, L2, L3 caches in order to be able to deactivate certain portions to decrease the chance that an actual production defect is the cache memory areas renders the CPU defect as a whole).

To give a timing idea (source: costs to access caches and memory)

L1 cache: 1ns to 2ns (2-4 cycles)
L2 cache: 3ns to 5ns (6-10 cycles)
L3 cache: 12ns to 20ns (24-40 cycles)
RAM: 60ns (120 cycles)

Since we mix different CPU types these are just estimates but give a good idea whats really going when a memory value is fetched and we might have a hit or a miss in certain cache layer.

So a cache basically speeds up memory access greatly (60ns vs. 1ns).

Fetching a value, storing it in the cache for the chance of rereading it is good for variables that are often accessed but for memory copy operations it would be still to slow since one just reads a value, writes the value somewhere and never reads the value again... no cache hits, dead slow (beside this can happen in parallel since we have out of order execution).

This memory copy is so important that there are different means to speed it up. In early days memory was often able to copy memory outside of the CPU. It was handled by the memory controller directly, so a memory copy operation did not pollute the caches.

But beside from a plain memory copy other serial access of memory was quite common. An example is analyzing a series of information. Having an array of integers and calculating sum, mean, average or even more simple find a certain value (filter/search) were another very important class of algorithms run every time on any general purpose CPU.

So by analyzing memory access pattern it was apparent that data is read sequentially very often. There was a high probability that if a program reads the value at index i, that the program will also read the value i+1. This probability is slightly higher than the probability that the same program will also read the value i+2 and so on.

So given a memory address it was (and still is) a good idea to read ahead and fetch additional values. This is the reason why there is a boost mode.

Memory access in boost mode means, that an address is send and multiple values are sequentially send. Each additional value send only takes about additional 10ns (or even below).

Another problem was an address. Sending an address takes time. In order to address a large portion of memory large addresses has to be send. In early days it meant that the address bus was not big enough to send the address in a single cycle (tick) and more than one cycle was needed to send the address adding more delay.

A cache line of 64 bytes for instance means that the memory is divided in distinct (non-overlapping) blocks of memory being 64bytes in size. 64bytes mean the start address of each block has the lowest six address bits to be always zeros. So sending these six zero bits each time is not needed increasing the address space 64 times for any number of address bus width (welcome effect).

Another problem the cache line solves (beside reading ahead and saving / freeing six bits on the address bus) is in the way the cache is organized. For example if a cache would be divided in 8 byte (64bit) blocks (cells) one needs to store the address of the memory cell this cache cell holds the value for along with it. If the address would also be 64bit this means that half the cache size is consumed by the address resulting in an overhead of 100%.

Since a cache line is 64bytes and a CPU might use 64bit - 6bit = 58bit (no need to store the zero bits too right) means we can cache 64bytes or 512bits with an overhead of 58bit (11% overhead). In reality the addresses stored are even smaller than this but there are status informations (like is the cache line valid and accurate, dirty and needs to wrote back in ram etc).

Another aspect is that we have set-associative cache. Not every cache cell is able to store a certain address but only a subset of those. This makes the necessary stored address bits even smaller, allows parallel access of the cache (each subset can be accessed once but independent of the other subsets).

There is more especially when it comes to synchronize cache/memory access between the different virtual cores, their independent multiple processing units per core and finally multiple processors on one mainboard (which there are boards housing as much as 48 processors and more).

This is basically the current idea why we have cache lines. The benefit from reading ahead is very high and the worst case of reading a single byte out of a cache-line and never read the rest again is very slim since the probability is very slim.

The size of the cache-line (64) is a wise chosen trade-off between larger cache-lines makes it unlikely for the last byte of it to be read also in the near future, the duration it takes to fetch the complete cache line from memory (and to write it back) and also the overhead in cache organization and the parallelization of cache and memory access.

Illiteracy answered 30/8, 2016 at 10:28 Comment(2)

A set-associative cache uses some address bits to select a set, so the tags can be even shorter than your example. Of course, the cache also needs to keep track of which tag goes with which data array in the set, but there are usually more sets than ways within a set. (e.g. 32kB 8-way associative L1D cache, with 64B lines, in Intel x86 CPUs: offset 6 bit, index 6 bits. Tags only need to be 48-12 bits wide, because x86-64 (for now) only has 48-bit physical addresses. As I'm sure you know, it's not a coincidence that the low 12 bits is the page offset, so L1 can be VIPT without aliasing.) – Intercept 30/8, 2016 at 11:15

amazing answer bud ... is there a "like" button anywhere ? – Vestry 19/8, 2017 at 3:57

If cache lines are 64 bytes wide, then they correspond to blocks of memory which start on addresses that are divisible by 64. The least significant 6 bits of any address are an offset into the cache line.

So for any given byte, the cache line which has to be fetched can be found by clearing the least signficant six bits of the address, which corresponds to rounding down to the nearest address that is divisible by 64.

Though this is done by hardware, we can show the calculations using some reference C macro definitions:

#define CACHE_BLOCK_BITS 6
#define CACHE_BLOCK_SIZE (1U << CACHE_BLOCK_BITS)  /* 64 */
#define CACHE_BLOCK_MASK (CACHE_BLOCK_SIZE - 1)    /* 63, 0x3F */

/* Which byte offset in its cache block does this address reference? */
#define CACHE_BLOCK_OFFSET(ADDR) ((ADDR) & CACHE_BLOCK_MASK)

/* Address of 64 byte block brought into the cache when ADDR accessed */
#define CACHE_BLOCK_ALIGNED_ADDR(ADDR) ((ADDR) & ~CACHE_BLOCK_MASK)

Thinker answered 11/6, 2013 at 2:46 Comment(2)

I have hard time to understand this. I know is 2 year later, but can you give me example code for this? one or two lines. – Spraggins 1/9, 2015 at 20:58

@Spraggins The reason this method works lies in binary number system. Any power of 2 has just one bit set and all remaining bits cleared, so for 64, you've 0b1000000, notice that the last 6 digits are zeros, so even when you've some number with any of those 6 set (which represent number % 64), clearing them will give you the nearest 64-byte aligned memory address. – Guru 16/11, 2015 at 15:26

Processors may have multi-level caches (L1, L2, L3), and these differ on size and speed.

Yet, to understand what exactly goes into each cache you'll have to study the branch predictor used by that specific processor, and how the instructions/data of your program behave against it.

Read about branch predictor, CPU cache and replacement policies.

This is not an easy task. If at the end of the day all you want is a performance test, you can use a tool like Cachegrind. However, as this is a simulation, its result may differ at some degree.

Mantelletta answered 16/10, 2010 at 1:21 Comment(0)

I can't say for certain as every hardware is different, but it is typically "64 bytes start at the closest 64 bytes boundary below" as that is a very fast and simple operation for the CPU.

Neall answered 16/10, 2010 at 1:34 Comment(1)

I can say for certain. Any reasonable cache design will have lines with sizes that are a power of 2, and that are naturally aligned. (e.g. 64B-aligned). It's not just fast and simple, it's literally free: you just ignore the low 6 bits of the address, for example. Caches often do different things with different ranges of addresses. (e.g. cache cares about tag and index for detecting hit vs. miss, then only using the offset within a cache line for inserting / extracting data) – Intercept 30/8, 2016 at 11:0

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags