How do x86 page tables work?
Asked Answered
B

1

27

I'm familiar with the MIPS architecture, which is has a software-managed TLB. So how and where you (the operating system) wants to store the page tables and the page table entries is completely up to you. For example I did a project with a single inverted page table; I saw others using 2-level page tables per process.

But what's the story with x86? From what I know the TLB is hardware-managed. Does x86 tell basically tell you, "Hey this is where the page table entries you're currently using need to go [physical address range]"? But wait, I've always thought x86 uses multi-level page tables, so would it tell you where to put the 1st level or something...? I'm confused.

Thanks for any help.

Branca answered 20/5, 2012 at 5:55 Comment(3)
Don't be confused. Read the docs. The official CPU documentation from Intel and AMD describes page tables pretty well.Gales
Intel® 64 and IA-32 Architectures Software Developer ManualsOrlan
This may help.Raye
M
35

Upon entering protected mode, the CR3 register points to a "page directory" (you can put it anywhere you want before you enter protected mode), which is a page of memory (remember, a "small" page is 4 KiB, and a "large" page is 4 MiB) with 1024 page directory entries (PDEs) that point to to "page tables". Each entry is the top 10 bits of a pointer (the address of the page table), plus a bunch of flags that make up the bottom portion of the pointer (present, permission, dirty, etc.).

(The 1024 just comes from the fact that a page is 4096 bytes and a pointer is 4 bytes.)

Each "page table" is itself 1024 "page table entries" (PTEs), which, again, contains 1024 entries that point to physical pages in memory, along with a bunch of (almost the same) flags.

So, to translate a 32-bit virtual address, you take the top 10 bits of the pointer as an index into the table at CR3 (since there are 210 entries), and -- if that PDE is further subdivided (meaning it isn't a "large" page, which you can figure out from the flags) -- you take the top 20 bits of the PDE, look up the page table at that address, and index into it with the virtual address's next-topmost 10 bits. Then the topmost 20 bits refer you to the physical page, assuming the bottom 12 bits tell you the physical page is actually present.

If you're using Physical Address Extension (PAE), then you get another level in the hierarchy at the very top.

Note: for your own sanity (and maybe the CPU's), you'd probably want to map the page directory and the page table to themselves, otherwise things get confusing fast. :)

The TLB is hardware-managed -- so the caching of the page tables is transparent -- but there is an instruction, InvlPG, that invalidates a PTE in the the TLB for you. (I don't know exactly when you should use it and when you shouldn't.)

Source: http://wiki.osdev.org/Paging

Marlowe answered 20/5, 2012 at 6:2 Comment(16)
excellent summary, thanks! That CR3 register provides the critical hardware support needed to get started on the translation and allows more flexibility to the programmer. However on translating the virtual address I think it works as such: top 10 bits, page directory entry, OK (tells you which page table). But then next 10 bits would tell you which entry in that particular page table you're looking at. This gives you a page table entry (PTE) where the top 20 bits are the physical page number; then take the original vaddr's offset (bottom 12 bits, for 2^12=4K pages) and voila you're done.Branca
(Because I ran out of text) - Maybe we're saying the same thing here. The perspective I came from is as such: eecs.harvard.edu/~mdw/course/cs161/sp07/notes/paging.pdf - See slide 26Branca
@YoungMoney: Yeah it seems like we're saying the same thing here... but in any case I think what you said is correct. :)Marlowe
OK and I had one more question - this multi-level page table would be for each process, yes? Does that mean on every context switch we need to re-point the CR3 register?Branca
@YoungMoney: It depends on the particular OS. The part of the virtual address space that is process-specific would indeed need to change, but the part that is global doesn't need to. (In Windows, the kernel is mapped to the same location in every process, so its page table/directory entries don't need to change.) I don't know if CR3 is actually modified, specifically (it's an implementation detail IMO), but the bottom line is that only the part of the page tables that is different needs to change.Marlowe
Sorry for being a little late to the party! Just wanted to know that with multi-level (or hierarchical) page tables, the page tables themselves can be paged right? So during a context switch, the scheduler (or whichever entity controls process creation and switching to it) loads CR3 with the `virtual address' of the page directory. So what happens if that VA itself is NOT mapped? will it re curse? If yes where does it end? Or when you said they must be "mapped to themselves" is it this problem that can be avoided by that?Beg
@HighOnMeat: CR3 is the physical address of the page directory; therefore its target (which is only one page of memory -- not a lot) must always be present and cannot "page fault". The page directory entry then again contains the physical address of the page table. However, it has a Present bit that can be false, and in that case you will get a page fault during translation and hence you can handle it by loading the page table into memory, setting the Present bit to true, and retrying translation. Same deal happens at the next layer with the page table entry. Does that answer your question?Marlowe
It cleared some of my doubts. I've got some more questions here though: 1.You say that CR3 contains physical address of the PGD(page directory) and "cannot page fault".So are you implying that whenever a new process is forked, an empty physical page frame is locked down in RAM for the PGD? If yes could you please pass me a reference? 2.You say that even PGD entries contain either invalid data or the physical address of page table. In case the entry is invalid, a physical page frame is allocated and corresponding PGD entry updated by MMU? Questions continued in next commentBeg
3. Is every entry in every table level, be it PGD or PMD(for more layers in case of say x86-64) actually a physical address to the next level table? If yes then why do we need the valid bit for? Does it indicate that the physical page frame at that address actually contains stuff that we (this process) needs?Beg
@HighOnMeat: (1) I don't have a reference (no time to find one sorry) but I'm not quite sure what you mean by "an empty physical page frame (what's a page frame?) is locked down". You could just as well modify the original page directory, if that's what you mean. Only the active one needs to actually be in memory, obviously -- the inactive ones can be paged out. (2) The entries are updated by the OS, not the MMU... I feel like one of us is very confused. (3) Yes, and if by "valid" you mean "present", I think the answer you're looking for is "large page support" (e.g. 4MB pages).. look that up.Marlowe
@Mehrdad: (1)when you say the active one needs to actually be in memory do you mean that the active PGD is always present in memory so the only thing that the OS needs to do (after a context switch) is pop off the previous CR3 off the stack and voila!!? (2) Yes my bad what I meant is the MMU walks the page table (in architectures supporting a hardware page walk) and whenever it encounters an invalid entry, it "page faults" into the OS which then brings in the page into memory, updates the entry and restarts the translation.Correct? (3)Doesn't "valid" mean present for 4KB pages aswell?Beg
@HighOnMeat: (1) I don't know what "stack" you're talking about... I'm saying the page directory that CR3 points to must be in memory and cannot be paged out. (2) Yup. (3) Yeah and I'm still confused what you're trying to say.Marlowe
@Mehrdad:(1)What I'm asking is, how does the OS ensure that this page directory to which the CR3 points to, does NOT get paged out? Regarding (3), as I understand, a valid PTE(page table entry) means the virtual page address corresponding to the PTE is backed by a physical page in RAM. But by your answer above you implied (or rather I understood it as implying) that this was only true for "large pages" aka >4KB. So all I am asking is does "valid" for a PTE corresponding to 4KB pages have a different meaning that this. I hope I've been clear this time aroundBeg
@HighOnMeat: The OS is the entity responsible for paging out pages. So if it doesn't want to page something out, then it simply won't. As for the second part, I think I didn't address your question well. What I was trying to say was that there is a hierarchy of page tables here: every table entry either points to another table or points to the actual data page (distinguished by a single bit). The sooner you reach an actual page, the larger it is; the later you reach it, the smaller (more fine-grained) it is. Valid simply means whatever you point to is valid, whether data or another table.Marlowe
Any one know what the flags are. I can see ( I think ) that the entries in CR3 and the tables are pointers to pages (not to bytes) and are therefore 32-12=20bits (in 32 bit mode). Leaving 12 bits for flags, of which one was mentioned, the valid bit. Is this correct? And what are the other flags?Enface
@ctrl-alt-delor: Sorry for the late reply, but you can check what the other bits mean in Intel's Software Developer's Manual. There are a few tables that begin with "Use of CR3...". For example this one: Use of CR3 with 32-Bit Paging.Marlowe

© 2022 - 2024 — McMap. All rights reserved.