When to do or not do INVLPG, MOV to CR3 to minimize TLB flushing

Asked 7/2, 2015 at 16:0 Answered 7/2, 2015 at 17:1

Solved x86 paging x86-64 virtual-memory tlb

Prologue

I am an operating system hobbyist, and my kernel runs on 80486+, and already supports virtual memory.

Starting from 80386, the x86 processor family by Intel and various clones thereof has supported virtual memory with paging. It is well known that when the PG bit in CR0 is set, the processor uses virtual address translation. Then, the CR3 register points to the top-level page directory, that is the root for 2-4 levels of page table structures that map the virtual addresses to physical addresses.

The processor does not consult these tables for each virtual address generated, instead caching them in a structure called Translation Lookaside Buffer, or TLB. However, when changes to the page tables are made, the TLB needs to be flushed. On 80386 processors, this flush would be done by reloading (MOV) CR3 with the top level page directory address, or a task switch. This supposedly unconditionally flushes all the TLB entries. As I understand, it would be perfectly valid for a virtual memory system to always reload CR3 after any change.

This is wasteful, since the TLB would now throw out completely good entries, thus in 80486 processors the INVLPG instruction was introduced. INVLPG will invalidate the TLB entry matching the source operand address.

Yet starting with Pentium Pro, we also have global pages that are not flushed with the moves to CR3 or task switch; and AMD x86-64 ISA says that some upper level page table structures might be cached and not invalidated by INVLPG. To get a coherent picture of what is needed and what is not needed on each ISA one would really need to download a 1000-page datasheet for a multitudes of ISAs released since 80s to read a couple pages therein, and even then the documents seem to be particularly vague as to the TLB invalidation and what happens if the TLB is not properly invalidated.

Question

For the simplicity, one can assume that we are talking about a uniprocessor system. Also, it can be assumed that no task-switch is required after changing the page structures. (thus INVLPG is always supposedly at least as good choice as reloading the CR3 register).

The base assumption is that one would need to reload CR3 after each change to page tables and page directories, and such a system would be correct. However, if one wants to avoid flushing the TLB needlessly, one needs answers to the 2 questions:

Provided that INVLPG is supported on the ISA, after what kind of changes can one safely use it instead of reloading the CR3? E.g. "If one unmaps one page frame (set the corresponding table entry to not present), one can always use INVLPG"?
What kind of changes one can do to the tables and directories without touching either CR3 or executing INVLPG? E.g. "If a page is not mapped at all (not present), one can write a PTE with Present=1 for it without flushing the TLB at all"?

Even after reading a quite a load of ISA documents and everything related to INVLPG here on Stack Overflow I am not personally sure of either examples I presented there. Indeed, one notable post stated it right away: "I don't know exactly when you should use it and when you shouldn't." Thus any certain, correct examples, preferably documented, and for either IA32 or x86-64, that you can give, are appreciated.

Spahi answered 7/2, 2015 at 16:0 Comment(2)

related: some x86 microarchitectures guarantee coherent page walks for changing mappings for valid pages that aren't in the TLB. e.g. on Intel SnB-family CPUs, speculative TLB loads are shot down if a change to that PTE happens before the insn that would use it. Apparently Win95 depended on this, but AMD Bulldozer-family doesn't do this. – Unilateral 10/4, 2016 at 3:22

@PeterCordes you could add some of that as an answer – Whatever 10/4, 2016 at 7:14

In the simplest possible terms; the requirement is that anything the CPU's TLB could have remembered that has changed has to be invalidated before anything that relies on the change happens.

The things that the CPU's could have remembered include:

the final permissions for the page (the combination of read/write/execute permissions from the page table entry, page directory entry, etc); including whether the page is present or not (see the warning below)
the physical address of the page
the "accessed" and "dirty" flags
the flags that effect caching
whether it's a normal page or a large (2 or 4 MiB) page or a huge (1 GiB) page

WARNING: Because Intel CPUs don't remember "not present" pages, documentation from Intel may say that you don't need to invalidate when changing a page from "not present" to "present". Intel's documentation is only correct for Intel CPUs. It is not correct for all 80x86 CPUs. Some CPUs (mostly Cyrix) do remember when a page was "not present" and because of those CPUs you do have to invalidate when changing a page from "not present" to "present".

Note that due to speculative execution you can not cut corners. For example, if you know a page has never been accessed you can't assume it's not in the TLB because the TLB may have been speculatively fetched.

I have chosen the words "before anything that relies on the change happens" very carefully. Modern CPUs (especially for long mode) do cache the higher level paging structures (e.g. PDPT entries) and not just the final pages. This means that if you change a higher level paging structure but the page table entries themselves remain the same, you still need to invalidate.

It also means that it is possible to skip the invalidation if nothing relies on the change. A simple example of this is with the accessed and dirty flags - if you aren't relying on these flags (to determine "least recently used" and which pages to send to swap space) then it doesn't matter much if the CPU doesn't realise that you've change them. It is also possible (not recommended for single-CPU but very recommended for multi-CPU) to skip the TLB invalidation in cases where you'd get a page fault if the CPU is using the old/stale TLB information, where the page fault handler invalidates if and only if it's actually necessary.

In addition; "anything the CPU's TLB could have remembered" is a little tricky. Often an OS will map the paging structures themselves into the virtual address space to allow fast/easy access to them (e.g. the common "recursive mapping" trick where you pretend the page directory is a page table). In this case when you change a page directory entry you need to invalidate the effected normal pages (as you'd expect) but you also need to invalidate anything the change effected in any mappings.

For which to use (INVLPG or reloading CR3) there are several issues. For a single page INVLPG will be faster. If you change a page directory (effecting 1024 pages or 512 pages, depending on which flavour of paging) then using INVLPG in a loop may or may not be more expensive that just reloading CR3 (it depends on CPU/hardware, and the access patterns for the code following the invalidation).

There are 2 more issues that come into this. The first is task switching. When switching between tasks that use different virtual address spaces you have to change CR3. This means that if you change something that effects a large area (e.g. a page directory) you can improve overall performance by doing a task switch early, rather than reloading CR3 now (for invalidation) and then reloading CR3 soon after (for the task switch). Basically, it's a "kill 2 birds with one stone" optimisation.

The other thing is "global pages". Typically there's pages that are the same in all virtual address spaces (e.g. the kernel). When you reload CR3 (e.g. during a task switch) you don't want TLBs for the pages that remain the same to be invalidated for no reason, because that would hurt performance more than necessary. To fix that and improve performance, (for Pentium and later) there's a feature called "global pages" where you get to mark these common pages as global and they are not invalidated when you reload CR3. In that case, if you need to invalidate global pages you need to use either INVPLG or change CR4 (e.g. disable and then reenable the global pages feature). For larger areas (e.g. changing a page directory and not just one page) it's the same as before (messing with CR4 may be faster or slower than INVLPG in a loop).

Inflect answered 7/2, 2015 at 17:1 Comment(1)

Excellent, the Cyrix part especially is exactly why I think this is a good thing to ask on Stack Overflow – Whatever 7/2, 2015 at 17:9

To your first question:

You can always use INVLPG and you can do any change possible. Use of INVLPG is always safe.
Reloading CR3 does not invalidate global pages in the TLB. So sometimes you must use INVLPG as reloading CR3 has no effect.
INVLPG must be used for every page involved. If you are changing multiple pages at a time then there comes a point where reloading CR3 is faster than a multitude of INVLPG calls.
Don't forget the Address Space Identifier extension on modern CPUs.

To your second question:

A page that is not mapped can not be cached in the TLB (assuming you invalidated it correctly when you unmapped it previously). So any change from not-present does not need INVLPG or CR3 reloading.

Outland answered 7/2, 2015 at 16:11 Comment(2)

Oh and don't forget about SMP where you have to shootdown the entry in other cores too. – Outland 7/2, 2015 at 16:12

I added the constraint of single processor system, so no need to consider SMP now. – Whatever 7/2, 2015 at 16:17

Prologue

Question

Recommended topics

Hot tags