How are cache memories shared in multicore Intel CPUs?

Asked 3/6, 2009 at 14:6 Answered 5/6, 2009 at 2:47

performance x86 multiprocessing intel cpu-cache

I have a few questions regarding Cache memories used in Multicore CPUs or Multiprocessor systems. (Although not directly related to programming, it has many repercussions while one writes software for multicore processors/multiprocessors systems, hence asking here!)

In a multiprocessor system or a multicore processor (Intel Quad Core, Core two Duo etc..) does each cpu core/processor have its own cache memory (data and program cache)?
Can one processor/core access each other's cache memory, because if they are allowed to access each other's cache, then I believe there might be lesser cache misses, in the scenario that if that particular processors cache does not have some data but some other second processors' cache might have it thus avoiding a read from memory into cache of first processor? Is this assumption valid and true?
Will there be any problems in allowing any processor to access other processor's cache memory?

Reactionary answered 3/6, 2009 at 14:6 Comment(1)

In a multiprocessor system or a multicore processor (Intel Quad Core, Core two Duo etc..) does each cpu core/processor have its own cache memory (data and program cache)?

Yes. It varies by the exact chip model, but the most common design is for each CPU core to have its own private L1 data and instruction caches.

On old and/or low-power CPUs, the next level of cache is typically a L2 unified cache is typically shared between all cores. Or on 65nm Core2Quad (which was two core2duo dies in one package), each pair of cores had their own last-level cache and couldn't communicate as efficiently.

Modern mainstream Intel CPUs (since the first-gen i7 CPUs, Nehalem) use 3 levels of cache.

32kiB split L1i/L1d: private per-core (same as earlier Intel)
256kiB unified L2: private per-core. (1MiB on Skylake-avx512).
large unified L3: shared among all cores

Last-level cache is a a large shared L3. It's physically distributed between cores, with a slice of L3 going with each core on the ring bus that connects the cores. Typically 1.5 to 2.25MB of L3 cache with every core, so a many-core Xeon might have a 36MB L3 cache shared between all its cores. This is why a dual-core chip has 2 to 4 MB of L3, while a quad-core has 6 to 8 MB.

On CPUs other than Skylake-avx512, L3 is inclusive of the per-core private caches so its tags can be used as a snoop filter to avoid broadcasting requests to all cores. i.e. anything cached in a private L1d, L1i, or L2, must also be allocated in L3. See Which cache mapping technique is used in intel core i7 processor?

David Kanter's Sandybridge write-up has a nice diagram of the memory heirarchy / system architecture, showing the per-core caches and their connection to shared L3, and DDR3 / DMI(chipset) / PCIe connecting to that. (This still applies to Haswell / Skylake-client / Coffee Lake, except with DDR4 in later CPUs).

Can one processor/core access each other's cache memory, because if they are allowed to access each other's cache, then I believe there might be lesser cache misses, in the scenario that if that particular processors cache does not have some data but some other second processors' cache might have it thus avoiding a read from memory into cache of first processor? Is this assumption valid and true?

No. Each CPU core's L1 caches tightly integrate into that core. Multiple cores accessing the same data will each have their own copy of it in their own L1d caches, very close to the load/store execution units.

The whole point of multiple levels of cache is that a single cache can't be fast enough for very hot data, but can't be big enough for less-frequently used data that's still accessed regularly. Why is the size of L1 cache smaller than that of the L2 cache in most of the processors?

Going off-core to another core's caches wouldn't be faster than just going to L3 in Intel's current CPUs. Or the required mesh network between cores to make this happen would be prohibitive compared to just building a larger / faster L3 cache.

The small/fast caches built-in to other cores are there to speed up those cores. Sharing them directly would probably cost more power (and maybe even more transistors / die area) than other ways of increasing cache hit rate. (Power is a bigger limiting factor than transistor count or die area. That's why modern CPUs can afford to have large private L2 caches).

Plus you wouldn't want other cores polluting the small private cache that's probably caching stuff relevant to this core.

Will there be any problems in allowing any processor to access other processor's cache memory?

Yes -- there simply aren't wires connecting the various CPU caches to the other cores. If a core wants to access data in another core's cache, the only data path through which it can do so is the system bus.

A very important related issue is the cache coherency problem. Consider the following: suppose one CPU core has a particular memory location in its cache, and it writes to that memory location. Then, another core reads that memory location. How do you ensure that the second core sees the updated value? That is the cache coherency problem.

The normal solution is the MESI protocol, or a variation on it. Intel uses MESIF.

Accordion answered 5/6, 2009 at 2:47 Comment(4)

The "variety" of solutions is really not that varied. Pretty much everything uses some minor variation on the MESI protocol. Many caches can have a copy of a Shared line, but only one cache in the coherency domain (i.e. the system) can have a line in Modified or Exclusive state. So to write a line, a CPU does a Read-For-Ownership to make sure no other cache in the system has a copy of that line. Related: how atomic read-modify-write works (lock inc [mem]): stackoverflow.com/questions/39393850/… – Ainslie 20/9, 2017 at 13:29

Update 8 years later: these days it's typical for CPUs to have private per-core L1 and L2 caches, with a shared L3. (Intel since Nehalem.) The L3 can back-stop coherency traffic so it doesn't have to go all the way to memory. – Ainslie 20/9, 2017 at 13:33

Also NUMA (en.wikipedia.org/wiki/Non-uniform_memory_access) is an interesting and relevant topic. – Baumann 5/3, 2018 at 10:36

This answer was in need of a major overhaul for various reasons: 3-level caches on everything after Nehalem, and various slight technical mis-statements. I hope I managed to improve things without making it unreadable for people without cpu-architecture experience. – Ainslie 3/1, 2019 at 9:45

Quick answers 1) Yes 2)No, but it all may depend on what memory instance/resource you are referring, data may exist in several locations at the same time. 3)Yes.

For a full length explanation of the issue you should read the 9 part article "What every programmer should know about memory" by Ulrich Drepper ( http://lwn.net/Articles/250967/ ), you will get the full picture of the issues you seem to be inquiring about in a good and accessible detail.

Musset answered 5/6, 2009 at 2:7 Comment(0)

To answer your first, I know the Core 2 Duo has a 2-tier caching system, in which each processor has its own first-level cache, and they share a second-level cache. This helps with both data synchronization and utilization of memory.

To answer your second question, I believe your assumption to be correct. If the processors were to be able to access each others' cache, there would obviously be less cache misses as there would be more data for the processors to choose from. Consider, however, shared cache. In the case of the Core 2 Duo, having shared cache allows programmers to place commonly used variables safely in this environment so that the processors will not have to access their individual first-level caches.

To answer your third question, there could potentially be a problem with accessing other processors' cache memory, which goes to the "Single Write Multiple Read" principle. We can't allow more than one process to write to the same location in memory at the same time.

For more info on the core 2 duo, read this neat article.

http://software.intel.com/en-us/articles/software-techniques-for-shared-cache-multi-core-systems/

Bainbridge answered 3/6, 2009 at 14:21 Comment(0)

Recommended topics

Hot tags