Is the TLB shared between multiple cores?

The TLB caches the translations listed in the page table. Each CPU core can be running in a different context, with different page tables. Each core has its own MMU, although really it's not a separate unit at all, it's part of the core (parts of load/store ports, the TLB, and page-walker). Any shared caches are always physically-indexed / physically tagged, so they cache based on post-MMU physical address.

The TLB is an implementation detail (just a cache of PTEs, page table entries) that could vary by microarchitecture. In practice, all that really varies is the size. It's always per-core. 2-level TLBs are common now, to keep full TLB misses to a minimum but still be small & fast enough allow 3 translations per clock cycle (for data load/store, in parallel with the iTLB.)

It's much faster to just re-walk the page tables (which can be hot in local L1 data or L2 cache) to rebuild a TLB entry than to try to share TLB entries across cores. This is what sets the lower bound on what extremes are worth going to in avoiding TLB misses, unlike with data caches which are the last line of defence before you have to go off-core to shared L3 cache, or off-chip to DRAM on an L3 miss.

For example, Skylake added a 2nd page-walk unit (to each core). Good page-walking is essential for workloads where logical cores can't usefully share TLB entries (threads from different processes, or not touching many shared virtual pages).

A shared TLB would mean that invlpg to invalidate cached translations when you do change a page table would always have to go off-core. (Although in practice an OS needs to make sure other cores running other threads of a multi-threaded process have their private TLB entries "shot down" during something like munmap, using software methods for inter-core communication like an IPI (inter-processor interrupt).)

But with private TLBs, a context switch to a new process can just set a new CR3 (top-level page-directory pointer) and invalidate this core's whole TLB without having to bother other cores or track anything globally.

There is a PCID (process context ID) feature that lets TLB entries be tagged with one of 16 or so IDs so entries from different process's page tables can be hot in the TLB instead of needing to be flushed on context switch. For a shared TLB you'd need to beef this up. (PCIDs are per-core, so tracking what tasks have been running recently can be done separately for each core.)

Another complication is that TLB entries need to track "dirty" and "accessed" bits in the PTE. They're typically a write-through cache of PTEs.

For an example of how the pieces fit together in a real CPU, see David Kanter's writeup of Intel's Sandybridge design. Note that the diagrams are for a single SnB core. The only shared-between-cores cache in most CPUs is the last-level data cache.

Intel's SnB-family designs all use a 2MiB-per-core modular L3 cache on a ring bus. So adding more cores adds more L3 to the total pool, as well as adding new cores (each with their own L2/L1D/L1I/uop-cache, and two-level TLB.)

Recommended topics

Hot tags