How are the modern Intel CPU L3 caches organized?

Asked 6/3, 2015 at 2:18 Answered 20/1, 2018 at 19:14

Given that CPUs are now multi-core and have their own L1/L2 caches, I was curious as to how the L3 cache is organized given that its shared by multiple cores. I would imagine that if we had, say, 4 cores, then the L3 cache would contain 4 pages worth of data, each page corresponding to the region of memory that a particular core is referencing. Assuming I'm somewhat correct, is that as far as it goes? It could, for example, divide each of these pages into sub-pages. This way when multiple threads run on the same core each thread may find their data in one of the sub-pages. I'm just coming up with this off the top of my head so I'm very interested in educating myself on what is really going on underneath the scenes. Can anyone share their insights or provide me with a link that will cure me of my ignorance?

Many thanks in advance.

Hepplewhite answered 6/3, 2015 at 2:18 Comment(2)

not a programming question. check out what fits better: stackexchange.com/sites# – Jaclynjaco 6/3, 2015 at 16:15

If you split the shared cache between the cores based on some memory range scheme, you'd lose the capacity benefits of sharing - being able to use more than your share when possible. You'd also lose the bandwidth benefits from banking. – Foreclosure 7/3, 2015 at 14:36

There is single (sliced) L3 cache in single-socket chip, and several L2 caches (one per real physical core). L3 cache caches data in segments of size of 64 bytes (cache lines), and there is special Cache coherence protocol between L3 and different L2/L1 (and between several chips in the NUMA/ccNUMA multi-socket systems too); it tracks which cache line is actual, which is shared between several caches, which is just modified (and should be invalidated from other caches). Some of protocols (cache line possible states and state translation): https://en.wikipedia.org/wiki/MESI_protocol, https://en.wikipedia.org/wiki/MESIF_protocol, https://en.wikipedia.org/wiki/MOESI_protocol

In older chips (epoch of Core 2) cache coherence was snooped on shared bus, now it is checked with help of directory.

In real life L3 is not just "single" but sliced into several slices, each of them having high-speed access ports. There is some method of selecting the slice based on physical address, which allow multicore system to do many accesses at every moment (each access will be directed by undocumented method to some slice; when two cores uses same physical address, their accesses will be served by same slice or by slices which will do cache coherence protocol checks). Information about L3 cache slices was reversed in several papers:

https://cmaurice.fr/pdf/raid15_maurice.pdf Reverse Engineering Intel Last-Level Cache Complex Addressing Using Performance Counters
https://eprint.iacr.org/2015/690.pdf Systematic Reverse Engineering of Cache Slice Selection in Intel Processors
https://arxiv.org/pdf/1508.03767.pdf Cracking Intel Sandy Bridge’s Cache Hash Function

With recent chips programmer has ability to partition the L3 cache between applications "Cache Allocation Technology" (v4 Family): https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology https://software.intel.com/en-us/articles/introduction-to-code-and-data-prioritization-with-usage-models https://danluu.com/intel-cat/ https://lwn.net/Articles/659161/

Antepast answered 20/1, 2018 at 19:14 Comment(0)

Modern Intel L3 caches (since Nehalem) use a 64B line size, the same as L1/L2. They're shared, and inclusive.
Except for Xeon-Scalable (Skylake) and later, where they have NINE (non-inclusive non-exclusive) L3 caches. which makes more sense for the larger per-core L2 caches whose total size is a significant fraction of L3.

Since SnB at least, each core has part of the L3, and they're on a ring bus. So even in big Xeons, L3 size scales linearly with number of cores.

See also Which cache mapping technique is used in intel core i7 processor? where I wrote a much larger and more complete answer.

PS: AMD Zen organizes L3 cache with groups of 4 or 8 cores (a core complex = CCX) sharing an L3, but separate L3s for separate CCXs in systems with more cores. So it's like a multi-socket system in terms of having multiple separate L3 caches.

According to https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Core_Complex -Zen 2's L3 cache is a victim cache for L2 evictions, and is exclusive of L2 caches most of the time. But not strictly: it can hang onto data accessed by multiple cores, or by instruction fetch. (Multiple threads are likely to run the same instructions, so populating L3 for I-cache misses makes sense.) I haven't checked details for other Zen versions. Older AMD CPUs often used exclusive caches, too.

Extort answered 5/7, 2016 at 12:49 Comment(0)

Recommended topics

Hot tags