Is the TLB shared between multiple cores?
Asked Answered
A

1

21

I've heard that TLB is maintained by the MMU not the CPU cache.
Then Does One TLB exist on the CPU and is shared between all processor or each processor has its own TLB cache?

Could anyone please explain relationship between MMU and L1, L2 Cache?

Almena answered 23/12, 2015 at 14:3 Comment(2)
Both private and shared TLB designs have been explored and they offer different tradeoffs. See my survey paper on TLB for a detailed discussion.Rostellum
It depends on implementation. cpuid could show TLB information on your PC.Abecedarium
C
22

The TLB caches the translations listed in the page table. Each CPU core can be running in a different context, with different page tables. Each core has its own MMU, although really it's not a separate unit at all, it's part of the core (parts of load/store ports, the TLB, and page-walker). Any shared caches are always physically-indexed / physically tagged, so they cache based on post-MMU physical address.

The TLB is an implementation detail (just a cache of PTEs, page table entries) that could vary by microarchitecture. In practice, all that really varies is the size. It's always per-core. 2-level TLBs are common now, to keep full TLB misses to a minimum but still be small & fast enough allow 3 translations per clock cycle (for data load/store, in parallel with the iTLB.)

It's much faster to just re-walk the page tables (which can be hot in local L1 data or L2 cache) to rebuild a TLB entry than to try to share TLB entries across cores. This is what sets the lower bound on what extremes are worth going to in avoiding TLB misses, unlike with data caches which are the last line of defence before you have to go off-core to shared L3 cache, or off-chip to DRAM on an L3 miss.

For example, Skylake added a 2nd page-walk unit (to each core). Good page-walking is essential for workloads where logical cores can't usefully share TLB entries (threads from different processes, or not touching many shared virtual pages).

A shared TLB would mean that invlpg to invalidate cached translations when you do change a page table would always have to go off-core. (Although in practice an OS needs to make sure other cores running other threads of a multi-threaded process have their private TLB entries "shot down" during something like munmap, using software methods for inter-core communication like an IPI (inter-processor interrupt).)

But with private TLBs, a context switch to a new process can just set a new CR3 (top-level page-directory pointer) and invalidate this core's whole TLB without having to bother other cores or track anything globally.

There is a PCID (process context ID) feature that lets TLB entries be tagged with one of 16 or so IDs so entries from different process's page tables can be hot in the TLB instead of needing to be flushed on context switch. For a shared TLB you'd need to beef this up. (PCIDs are per-core, so tracking what tasks have been running recently can be done separately for each core.)

Another complication is that TLB entries need to track "dirty" and "accessed" bits in the PTE. They're typically a write-through cache of PTEs.


For an example of how the pieces fit together in a real CPU, see David Kanter's writeup of Intel's Sandybridge design. Note that the diagrams are for a single SnB core. The only shared-between-cores cache in most CPUs is the last-level data cache.

Intel's SnB-family designs all use a 2MiB-per-core modular L3 cache on a ring bus. So adding more cores adds more L3 to the total pool, as well as adding new cores (each with their own L2/L1D/L1I/uop-cache, and two-level TLB.)

Caylacaylor answered 23/12, 2015 at 14:17 Comment(2)
With private TLBs, if the page of a given process is unmapped, are TLB shootdown IPIs sent to all cores or just the cores where threads of the process might be running? Furthermore if a single page is unmapped / remapped, are all the TLB entries of a given process invalidated or is there metadata associated with the invalidation IPI?Coyote
@user3882729: I'd assume an IPI TLB shootdown includes an address of the one page (or page-range?) to invalidate. As for only interrupting a limited set of cores, that would be up to the OS to decide how it wants to do it. I'm not sure what mainstream OSes currently do; TLB shootdowns are one thing I haven't looked into very much. Hmm, now I'm curious how it interacts with process-context IDs when there's a valid TLB entry for a task that isn't the currently-active page table on some other cores. I don't think x86-64 invlpg can invalidate a page from a different PCID.Caylacaylor

© 2022 - 2024 — McMap. All rights reserved.