What happens after a L2 TLB miss?
Asked Answered
E

1

32

I'm struggling to understand what happens when the first two levels of the Translation Lookaside Buffer result in misses?

I am unsure whether "page walking" occurs in special hardware circuitry, or whether the page tables are stored in the L2/L3 cache, or whether they only reside in main memory.

Englebert answered 27/8, 2015 at 17:51 Comment(0)
T
44

(Some of this is x86 and Intel-specific. Most of the key points apply to any CPU that does hardware page walks. I also discuss ISAs like MIPS that handle TLB misses with software.)

Modern x86 microarchitectures have dedicated page-walk hardware. They can even speculatively do page-walks to load TLB entries before a TLB miss actually happens. And to support hardware virtualization, the page-walkers can handle guest page tables inside a host VM. (Guest physical memory = host virtual memory, more or less. VMWare published a paper with a summary of EPT, and benchmarks on Nehalem).

Skylake can even have two page walks in flight at once, see Section 2.1.3 of Intel's optimization manual. (Intel also lowered the page-split load penalty from ~100 to ~5 or 10 extra cycles of latency, about the same as a cache-line split but worse throughput. This may be related, or maybe adding a 2nd page-walk unit was a separate response to discovering that page split accesses (and TLB misses?) were more important than they had previously estimated in real workloads).

Some microarchitectures protect you from speculative page-walks by treating it as mis-speculation when an un-cached PTE is speculatively loaded but then modified with a store to the page table before the first real use of the entry. i.e. snoop for stores to the page table entries for speculative-only TLB entries that haven't been architecturally referenced by any earlier instructions.

(Win9x depended on this, and not breaking important existing code is something CPU vendors care about. When Win9x was written, the current TLB-invalidation rules didn't exist yet so it wasn't even a bug; see Andy Glew's comments quoted below). AMD Bulldozer-family violates this assumption, giving you only what the x86 manuals say on paper.


The page-table loads generated by the page-walk hardware can hit in L1, L2, or L3 caches. Broadwell perf counters, for example, can count page-walk hits in your choice of L1, L2, L3, or memory (i.e. cache miss). The event name is PAGE_WALKER_LOADS.DTLB_L1 for Number of DTLB page walker hits in the L1+FB, and others for ITLB and other levels of cache.

Since modern page tables use a radix-tree format with page directory entries pointing to the tables of page table entries, higher-level PDEs (page directory entries) can be worth caching inside the page-walk hardware. This means you need to flush the TLB in cases where you might think you didn't need to. Intel and AMD actually do this, according to this paper (section 3). So does ARM, with their Intermediate table walk cache

That paper says that page-walk loads on AMD CPUs ignore L1, but do go through L2. (Perhaps to avoid polluting L1, or to reduce contention for read ports). Anyway, this makes caching a few high-level PDEs (that each cover many different translation entries) inside the page-walk hardware even more valuable, because a chain of pointer-chasing is more costly with higher latency.

But note that Intel guarantees no negative caching of TLB entries. Changing a page from Invalid to Valid doesn't require invlpg. (So if a real implementation does want to do that kind of negative caching, it has to snoop or somehow still implement the semantics guaranteed by Intel manuals.)

But there are old Cyrix CPUs that do perform negative caching, though. The common subset of x86 guarantees across vendors isn't always as strong as Intel's. 64-bit kernels should safely be able to change a PTE from not-present to present without invlpg, though, because those Cyrix chips were 32-bit-only. (If Intel, AMD, and Via manuals all agree that it's safe; IDK of any other x86-64 vendors.)


(Historical note: Andy Glew's answer to a duplicate of this question over on electronics.SE says that in P5 and earlier, hardware page-walk loads bypassed the internal L1 cache (it was usually write-through so this made pagewalk coherent with stores). IIRC, my Pentium MMX motherboard had L2 cache on the mobo, perhaps as a memory-side cache. Andy also confirms that P6 and later do load from the normal L1d cache.

That other answer has some interesting links at the end, too, including the paper I linked at the end of last paragraph. It also seems to think the OS might update the TLB itself, rather than just the page table, on a page fault (HW pagewalk doesn't find an entry), and wonders if HW page walking can be disabled on x86. (But actually the OS just modifies the page table in memory, and returning from #PF re-runs the faulting instruction so HW pagewalk will succeed this time.) Perhaps the paper is thinking of ISAs like MIPS where software TLB management / miss-handling is possible.

I don't think it's actually possible to disable HW pagewalk on P5 (or any other x86). That would require a way for software to update TLB entries with a dedicated instruction (there isn't one), or with wrmsr or an MMIO store. Confusingly, Andy says (in a thread I quoted below) that software TLB handling was faster on P5. I think he meant would have been faster if it had been possible. He was was working at Imation (on MIPS) at the time, where SW page walk is an option (sometimes the only option), unlike x86.

Or perhaps he meant using MSRs to set up TLB entries ahead of time in cases where you expect there not to already be one, avoiding some page walks. Apparently 386 / 486 had TLB-entry query / set access via special registers: https://retrocomputing.stackexchange.com/questions/21963/how-did-the-test-registers-work-on-the-i386-and-the-i486 But there's probably no P5 MSR equivalent for that 386/486 functionality.
AFAIK, there wasn't a way to have a TLB miss trap to a software function (with paging disabled?) even on 386/486, so you couldn't fully avoid the HW page walker, just prime the TLB to avoid some TLB misses, at least on 386/486.


As Paul Clayton points out (on another question about TLB misses), the big advantage of hardware page-walks is that TLB misses don't necessarily stall the CPU. (Out-of-order execution proceeds normally, until the re-order buffer fills because the load/store can't retire. Retirement happens in-order, because the CPU can't officially commit anything that shouldn't have happened if a previous instruction faulted.)

BTW, it would probably be possible to build an x86 CPU that handles TLB misses by trapping to microcode instead of having dedicated a hardware state-machine. This would be (much?) less performant, and maybe not worth triggering speculatively (since issuing uops from microcode means you can't be issuing instructions from the code that's running.)

Microcoded TLB handling could in theory be non-terrible if you run those uops in a separate hardware thread (interesting idea), SMT-style. You'd need it to have much less start/stop overhead than normal Hyperthreading for switching from single-thread to both logical cores active (has to wait for things to drain until it can partition the ROB, store queue, and so on) because it will start/stop extremely often compared to a usual logical core. But that may be possible if it's not really a fully separate thread but just some separate retirement state, so cache misses in it don't block retirement of the main code, and have it use a couple hidden internal registers for temporaries. The code it has to run is chosen by the CPU designers, so the extra HW thread doesn't have to anywhere near the full architectural state of an x86 core. It rarely has to do any stores (maybe just for the accessed flags in PTEs?), so it wouldn't be bad to let those stores use the same store queue as the main thread. You'd just partition the front-end to mix in the TLB-management uops and let them execute out of order with the main thread. If you could keep the number of uops per pagewalk small, it might not suck.

No CPUs actually do "HW" page-walks with microcode in a separate HW thread that I'm aware of, but it is a theoretical possibility.


Software TLB handling: some RISCs are like this, not x86

In some RISC architectures (like MIPS), the OS kernel is responsible for handling TLB misses. TLB misses result in execution of the kernel's TLB miss interrupt handler. This means the OS is free to define its own page table format on such architectures. I guess marking a page as dirty after a write also requires a trap to an OS-provided routine, if the CPU doesn't know about page table format.

This chapter from an operating systems textbook explains virtual memory, page tables, and TLBs. They describe the difference between software-managed TLBs (MIPS, SPARCv9) and hardware-managed TLBs (x86). A paper, A Look at Several Memory Management Units, TLB-Refill Mechanisms, and Page Table Organizations shows some example code from what is says is the TLB miss handler in Ultrix, if you want a real example.


Other links


Comments about TLB coherency from Andy Glew, one of the architects on Intel P6 (Pentium Pro / II / III), then later worked at AMD.

The main reason Intel started running the page table walks through the cache, rather than bypassing the cache, was performance. Prior to P6 page table walks were slow, not benefitting from cache, and were non-speculative. Slow enough that software TLB miss handling was a performance win1. P6 sped TLB misses up by doing them speculatively, using the cache, and also by caching intermediate nodes like page directory entries.

By the way, AMD was reluctant to do TLB miss handling speculatively. I think because they were influenced by DEC VAX Alpha architects. One of the DEC Alpha architects told me rather emphatically that speculative handling of TLB misses, such as P6 was doing, was incorrect and would never work. When I arrived at AMD circa 2002 they still had something called a "TLB Fence" - not a fence instruction, but a point in the rop or microcode sequence where TLB misses either could or could not be allowed to happen - I am afraid that I do not remember exactly how it worked.

so I think that it is not so much that Bulldozer abandoned TLB and page table walking coherency, whatever that means, as that Bulldozer may have been the first AMD machine to do moderately aggressive TLB miss handling.

recall that when P6 was started P5 was not shipping: the existing x86es all did cache bypass page table walking in-order, non-speculatively, no asynchronous prefetches, but on write through caches. I.e. They WERE cache coherent, and the OS could rely on deterministic replacement of TLB entries. IIRC I wrote those architectural rules about speculative and non-deterministic cacheability, both for TLB entries and for data and instruction caches. You can't blame OSes like Windows and UNIX and Netware for not following page table and TLB management rules that did not exist at the time.

IIRC I wrote those architectural rules about speculative and non-deterministic cacheability, both for TLB entries and for data and instruction caches. You can't blame OSes like Windows and UNIX and Netware for not following page table and TLB management rules that did not exist at the time.

Footnote 1: This is the surprising claim I mentioned earlier, possibly referring to using MSRs to prime the TLB to hopefully avoid some page walks.


More from Andy Glew from the same thread, because these comments deserve to be in a full answer somewhere.

(2) one of my biggest regrets wrt P6 is that we did not provide Intra-instruction TLB consistency support. Some instructions access the same page more than once. It was possible for different uops in the same instruction to get different translations for the same address. If we had given microcode the ability to save a physical address translation, and then use that, things would have been better IMHO.

(2a) I was a RISC proponent when I joined P6, and my attitude was "let SW (microcode) do it".

(2a') one of the most embarrassing bugs was related to add-with-carry to memory. In early microcode. The load would go, the carry flag would be updated, and the store could fault -but the carry flag had already been updated, so the instruction could not be restarted. // it was a simple microcode fix, doing the store before the carry flag was written - but one extra uop was enough to make that instruction not fit in the "medium speed" ucode system.

(3) Anyway - the main "support" P6 and its descendants gave to handling TLB coherency issues was to rewalk the page tables at retirement before reporting a fault. This avoided confusing the OS by reporting a fault when the page tables said there should not be one.

(4) meta comment: I don't think that any architecture has properly defined rules for caching of invalid TLB entries. // AFAIK most processors do not cache invalid TLB entries - except possibly Itanium with its NAT (Not A Thing) pages. But there's a real need: speculative memory accesses may be to wild addresses, miss the TLB, do an expensive page table walk, slowing down other instructions and threads - and then doing it over and over again because the fact that "this is a bad address, no need to walk the page tables" is not remembered. // I suspect that DOS attacks could use this.

(4') worse, OSes may make implicit assumptions that invalid translations are never cached, and therefore not do a TLB invalidation or MP TLB shoot down when transitioning from invalid to valid. // Worse^2: imagine that you are caching interior nodes of the page table cache. Imagine that PD contains all invalid PDE; worse^3, that the PD contains valid d PDEs that point to PTs that are all invalid. Are you still allowed to cache those PDEs? Exactly when does the OS need to invalidate an entry?

(4'') because MP TLB shoot downs using interprocessor interrupts were expensive, OS performance guys (like I used to be) are always making arguments like "we don't need to invalidate the TLB after changing a PTE from invalid to valid" or "from valid read-only to valid writable with a different address". Or "we don't need to invalidate the TLB after changing a PDE to point to a different PT whose PTEs are exactly the same as the original PT...". // Lots of great ingenious arguments. Unfortunately not always correct.

Some of my computer architect friends now espouse coherent TLBs: TLBs that snoop writes just like data caches. Mainly to allow us to build even more aggressive TLBs and page table caches, if both valid and invalid entries of leaf and interior nodes. And not to have to worry about OS guys' assumptions. // I am not there yet: too expensive for low end hardware. But might be worth doing at high end.

me: Holy crap, so that's where that extra ALU uop comes from in memory-destination ADC, even on Core2 and SnB-family? Never would have guessed, but had been puzzled by it.

Andy: often when you "do the RISC thing" extra instructions or micro instructions are required, in a careful order. Whereas if you have "CISCy" support, like special hardware support so that a single instruction is a transaction, either all done or all not done, shorter code sequences can be used.

Something similar applies to self modifying code: it was not so much that we wanted to make self modifying code run fast, as that trying to make the legacy mechanisms for self modifying code - draining the pipe for serializing instructions like CPUID - were slower than just snooping the Icache and pipeline. But, again, this applies to a high end machine: on a low end machine, the legacy mechanisms are fast enough and cheap.

Ditto memory ordering. High end snooping faster; low end draining cheaper.

It is hard to maintain this dichotomy.

It is pretty common that a particular implementation has to implement rules compatible with but stronger than the architectural statement. But not all implementations have to do it the same way.

This comment thread was on Andy's answer to a question about self-modifying code and seeing stale instructions; another case where real CPUs go above and beyond the requirements on paper, because it's actually easier to always snoop for stores near EIP/RIP than to re-sync only on branch instructions if you didn't keep track of what happened between branches.

Tega answered 27/8, 2015 at 20:29 Comment(19)
Good answer. Usually calling the OS to do a pagewalk is very unfriendly for performance, so most architectures keep that for special cases such as page faults.Poulos
Hi, thank you! Could you just clarify what you mean by "reads the page tables itself". As opposed to what? Thanks.Englebert
@user997112: As opposed to architectures like MIPS, described in the 2nd paragraph, where an OS function updates the TLB. See stackoverflow.com/questions/29565312/… for a more detailed explanation.Tega
@Leeor: MIPS apparently has software TLB updates. See my comments at the above link, and the post I left them under.Tega
Sorry guys, let me reword my question. Are you saying that the CPU contains circuitry to walk the page table, as opposed to loading software instructions from the L1 instruction cache, to do it? So its a case of circuitry vs L1 instruction cache instructions. And this circuitry can still find the page table within the L1/L2/L3 cache/main memory?Englebert
@user997112: On x86, it's definitely all internal to the CPU. What I don't know is whether Intel's current designs run uops from microcode on the usual execution ports, or whether there's separate hardware that can work on walking the page tables after a TLB miss while non-memory instructions execute completely unhindered. Either way, L1 instruction cache certainly won't be involved. The loads as it reads page table data probably use the normal cache hierarchy, so code that TLB misses a lot will probably have most of the internally-generated pg tbl loads hit in L3, L2 or even L1D cache.Tega
@user997112: This looks relevant: stackoverflow.com/questions/9338236/…. Internally-generated page table lookups can definitely hit in the data caches. I also found lwn.net/Articles/379748, which focuses on hugepages, but goes into some details about TLB performance (although some PPC, not just x86).Tega
@PeterCordes: you said that one advantage of a HW page table walker is that it can run at the same time as other code from the same program, versus SW or microcode which would have to stop the original code. This is true on all current machines I am familiar with, but it doesn’t need to be: consider handling the TLB miss is a different HW thread.Scornik
@KrazyGlew: Neat idea, added a paragraph about that. The extra HW thread wouldn't need anywhere near as much state as a full x86 core, and only has to do this, so you wouldn't want to partition the ROB or store queue, just the front-end.Tega
@PeterCordes this touches on something I don't know. My guess is that a I/STLB miss condition is also handed silently by PMH stuffing loads in MOB (which must have opcodes that uses physical addresses to directly access L1d). If it needs to write A/D bits it needs a microcode assist for some reason, which will flag on the current retiring uop. I would say it then uses the IQ to replay macro-ops, while flushing and resteering the pre IQ pipeline to the uop that failed to retire, but I'm not sure how that's implemented now that some of the uops can be from the uop cache. May replay from IDQ.Diver
@LewisKelsey: My mental model (based on guesswork) was that HW page walk accessed L1d cache separately from the load execution units, not ordered wrt. loads/stores done by the program. Possibly with another read port for L1d, or by competing with load execution units for access to actual L1d cache read ports. (Or with the read port used for transfers to L2, if that's separate; might be a better choice.) So no need to touch the MOB. But I guess it makes sense that stores to modify A/D bits need more ordering or more control, or just that they didn't want to multiplex the cache write port.Tega
@PeterCordes 'a unique microop communicates event info from "early" units in the pipeline (e.g., IFU and ID) to the fault info field of the ROB. Units in the in-order part of the processor report events by signaling the ID to insert the sig-- event uop into the micro op stream. For example, upon detection of a page fault the IFU causes the ID to insert the sig-- event uop into the instr stream including fault info indicating the nature of the fault. Similarly, if the ID receives an illegal instruction, it inserts into the instr stream a sig-- event uop specifying the nature of the fault. ' hmmDiver
@LewisKelsey: Interesting. I guess that makes sense as a mechanism to trigger a #PF or #UD once this still-speculative code fetch reaches retirement. But remember that a page-fault can't be detected until after a page-walk completes (because TLBs don't do negative caching), and that speculative early page-walk is very much allowed and encouraged, so this doesn't (to me) seem to conflict with what I suggested for HW page walk. I'm pretty sure HW page walk doesn't involve normal uops that show up in uops_executed.any or other normal counters, or even uops_dispatched_port.port_2 or 3.Tega
@PeterCordes yes I would agree. Only page faults during the walk cause an exception (which sets up the call to the software handler) and maybe also when it needs to set access / dirty bitsDiver
@LewisKelsey: page tables use physical addresses; you can't fault during a walk (except for needing to trigger an assist to set an A bit, and maybe a D bit for stores). The result of a walk could be that there's no valid mapping so the load, store, or code-fetch that triggered the walk should fault (if turns out to be on the true path of execution). For code-fetch, the front-end can't do anything else while waiting for a demand-miss page walk; it could just wait and insert a uop at that point if the page-walk result comes back invalid (or as needing an assist). This is my guesswork.Tega
That's what I mean't. The result is there's no valid mapping in the page table i.e. a page fault. The front end stalls and the fault uop obviously propagates to retirement. This might be part of a misspeculated path, so it may never have to be done, which is the benefit of not triggering the fault immediately. But also, so it is associated with the correct instruction in order. Also, unrelated: I think I worked out how the BOB works, finally. https://mcmap.net/q/14272/-what-branch-misprediction-does-the-branch-target-buffer-detect (i just rewrote this to the new model)Diver
@Noah: no, you need access to a deleted page to fault, not store to a page that you've now given to another process or used for the pagecache. Adding a new mapping doesn't need invlpg because the x86 ISA manuals basically promise that the hardware won't do "negative caching", or at least that you don't need invlpg there so any negative caching would require snooping to behave like each access will check the actual page tables again if the previous one faulted.Tega
@PeterCordes: Some 80x86 CPUs do "negative caching" (specifically old Cyrix chips). Intel promises that Intel CPUs won't do "negative caching"; but Intel (and Intel's manuals) don't/can't speak for other vendors (AMD, VIA, Cyrix, IBM, SiS, NexGen, ...).Deterrent
Thanks. I had hoped that Intel's TLB guarantees were also the common subset for x86 CPUs (and when I originally wrote this probably assume that). Maybe that is the case for x86-64. But yeah, store atomicity is another clear example of Intel's guarantees being stronger than the common subset that "x86" software can depend on that I've understood since after writing this.Tega

© 2022 - 2024 — McMap. All rights reserved.