Why in x86-64 the virtual address are 4 bits shorter than physical (48 bits vs. 52 long)?

Asked 1/10, 2017 at 4:17 Answered 25/7, 2022 at 19:49

Solved assembly x86-64 memory-address virtual-memory mmu

In the book "Low-Level Programming: C, Assembly, and Program Execution on Intel® 64 Architecture" I read:

Each virtual 64-bit address (e.g., ones we are using in our programs) consists of several fields. The address itself is in fact only 48 bits wide; it is sign-extended to a 64-bit canonical address. Its characteristic is that its 17 left bits are equal. If the condition is not satisfied, the address gets rejected immediately when used. Then 48 bits of virtual address are transformed into 52 bits of physical address with the help of special tables.

Why is there a difference of 4 bits between the virtual address and the physical address?

Antilog answered 1/10, 2017 at 4:17 Comment(4)

Counterquestion: Why should virtual and physical addresses have the same size? The 8-bit computers in the 1980s using more than 48k memory also used "memory banking" which more or less means that there were more physical address bits than virtual ones. – Marzipan 1/10, 2017 at 5:47

@MartinRosenau I'm sorry you think my question implies that I think virtual and physical addresses should have the same size. My intention was just to ask why the difference in that particular case. I'm finding something like that you wrote in your comment, but related to "the modern PC" and 64-bit addressing. – Antilog 1/10, 2017 at 6:10

Fun fact: If you want to use the high 16 for tagged pointers, you could shl rax,16 / sar rax,16 before using to redo the sign extension. (Or better, have your program only allocate tagged pointers in the low half of the canonical range, so you can just use and or BMI2 andn to make addresses canonical.) Or even better, allocate only in the low 4G of virtual address space, so you can use address-size (0x67) prefixes to ignore high garbage, or use 32-bit operand size when manipulating pointers to zero-extend them for free. – Febrifacient 1/10, 2017 at 15:21

I guess that if/when hardware support for wider virtual addresses happens, there might be a mmap(MAP_48BIT) flag equivalent to the current mmap(MAP_32BIT) so programs that want to use the high 16 for their own purposes can keep doing so. Using only the high byte might be safer for longer, since extending virtual far beyond physical is less likely, even with memory-mapped non-volatile storage becoming a thing. (e.g. faster-than-flash on DIMMs.) – Febrifacient 1/10, 2017 at 15:31

I believe you are talking about x86-64, my answer is based on that architecture.

When operating in 64-bit mode the CPU uses a revamped feature to translate virtual addresses into physical addresses known as PAE - Physical address extension.
Originally invented to break the 4GiB limit while still using 32-bit pointers, this feature involves the use of 4 level of tables.
Each table gives a pointer to the next table, down to the rightmost one that gives the upper bits of physical address. To get an idea look at this picture from the AMD64 Architecture Programming Manual:

The rationale behind all those tables is sparsity: the metadata for translating virtual addresses into physical addresses is huge - if we were to use 4KiB pages only we'd need 2^{64 - 12} = 2⁵² entries to cover the whole 64-bit address space.
Tables allow for a sparse approach, only the entries necessary are populated in memory.

This design is reflected in how the virtual address is divided (and thus, indirectly, in the number of levels), only runs of 9 bits are used to index the tables at each level.
Starting from bit 12 included, this gives: level 1 -> 12-20, level 2 -> 21-29, level 3 -> 30-38, level 4 -> 39-47.

This explains the current implementation limit of only 48 bits of virtual address space.
Note that at the instruction level, where logical addresses are used, we have full support for 64 bits addresses.
Full support is also available at the segmentation level, the part that translates logical addresses into linear addresses.
So the limitation comes from PAE.

My personal opinion is that AMD rushed to be the first to ship an x86 CPU with 64-bit support and reused PAE, patching it with a new level of indirection to translate up to 48 bits.
Note that both Intel and AMD allow a future implementation to use 64 bits for the virtual address (probably with more tables).

However, both companies set a hard limit of 52 bit for the physical address. Why?

The answer can still be found in how paging work.
In 32-bit mode, each entry in each table is 32 bits wide; the low bits are used as flags (since the alignment requirements make them useless for the translation process) but the higher bits were all used for the translation, giving a 32/32 virtual/physical translation.
It's important to stress out that all the 32 bits were used, while some of the lower bits were not used as flags, Intel marked them as "Ignored" or "Available" meaning with that that the OS was free to use them.

When Intel introduced PAE, they needed 4 more bits (PAE was 36 bits back then) and the logical thing to do was to double the size of each entry since this creates a more efficient layout than a, say, 40-bit table entry.
This gave Intel a lot of spare space and they marked it as reserved (This can be better observed in older versions of the Intel SDM manuals, like this one).

With time, new attributes were needed in an entry, the most famous one being the XD/NX bit.
Protection keys are also a, relatively new, feature that takes space in an entry. This shows that a full 64/64 bits virtual/physical translation is not possible anymore with the current ISA.

For a visual reference, here is the format of the 64-bit PAE table entries:

It shows that a 64-bit physical address is not possible (for huge pages there still is a way to fix this but given the layout of the bits that seems unlikely) but doesn't explain why AMD set the limit to 52 bits.

Well, it's hard to say.
Certainly, the size of the physical address space has some hardware cost associated with it: more pins (though with the integrated memory controller, this is mitigated as the DDR specs multiplex a lot of signals) and more space in the caches/TLBs.
In this question (similar but not enough make this a duplicate) an answer cities Wikipedia, that in turn allegedly cites AMD, claiming that AMD's engineers set the limit to 52 bits after due considerations of benefits and costs.

I share what Hans Passant wrote more than 6 years ago: the current paging mechanisms are not suitable for a full 64-bit physical addressing and that's probably the reason why both Intel and AMD never bothered keeping the high bits in each entry reserved.

Both companies know that as the technology will approach the 52-bit limit it will also be very different from its current form.
By that time they will have designed a better mechanism for memory in general, so they avoided over-engineering the existing one.

Thermel answered 1/10, 2017 at 9:9 Comment(13)

Thanks a lot for your reply! Is Amazing!! You said: "only runs of 9 bits are used to index the tables at each level" and then later kind of describe the virtual address components. But the author said that for each table index are used 12bits (adding up 48bit) instead of 9bit. I'm just saying in case that something good comes from this observation. The others bits are one sign bit and others 17bits (to add up 64bit) that have to be equal to the address not be discarded as is said in my book quote. The author describe the architecture as "Intel 64 architecture: also known as x86_64 and AMD64" – Antilog 1/10, 2017 at 15:28

@Margaret: Hans was only saying that 4k pages are too small. If huge memory spaces for non-volatile storage start becoming a thing, I suspect that TLBs will start getting more entries for 1G hugepages (current = 4x 1G entries fully assoc in Skylake), and OSes will let user-space map non-volatile storage with 1G hugepages. I'd guess that most database processes will want one or two huge contiguous mappings, and a 2-level page table (the effective depth with a 1G hugepage instead of a PDPTE) is fine for that, right? As I understand it, having more levels mostly helps when mappings are sparse. – Febrifacient 1/10, 2017 at 15:41

Even 2MB pages aren't terrible; Hans even suggested that 4M might be ok. (That's the x86-32 hugepage size.) And BTW, only serious high-performance software like databases will want to map non-volatile-storage DIMMs into its own virtual address space for the equivalent of direct-IO. Everything else will go through the filesystem. Or if it's fast enough (or DRAM is limited / non-existent), the OS could satisfy mmap(PROT_READ|PROT_EXEC) requests by mapping the non-volatile storage directly with 1G/2M/4k pages. It would be a perf win to limit your mapping to a 2M-aligned multiple of 2M. – Febrifacient 1/10, 2017 at 15:52

@PeterCordes, Yes, indeed the 4KiB is the problematic one. Huge pages are definitively the only way to go. It's hard to say what the future will be when we will be hitting the 2^52 limit, sizes like GiB may even be considered small. Personally, I believe that a software walked tables with a directly accessible TLB would be a better approach (like it happen for some MIPS implementation IIRC) – Thermel 1/10, 2017 at 16:3

@gsi-frank Maybe the author was referring to the 32-bit paging? That uses 12-bit indexes. – Thermel 1/10, 2017 at 16:6

@MargaretBloom: Hmm, maybe a hybrid approach would be possible (with SW page walks for regions selected by a bit in their PML4E). Unless we get rid of 4k/2M pages altogether, hardware speculative page walks happening in parallel with other work are too valuable. Current x86 implementations have dedicated page-walk hardware; it's not a microcode-assisted thing that takes over the pipeline, so it can happen as part of next-page prefetch while looping over big arrays. See stackoverflow.com/questions/32256250/…. – Febrifacient 1/10, 2017 at 16:13

Any kind of context-switch is vastly more expensive than HW page walk. Maybe there could be a way around with a kernel-supplied function pointer that executes in a special mode that can only read memory and update a TLB. (It can't use user-space memory for anything, otherwise another thread could attack it and lead to priv. escalation. Denying memory write entirely might make it plausible to let out-of-order execution mix this in with user-space uops) – Febrifacient 1/10, 2017 at 16:21

@PeterCordes Good point about speculative and current HW page walk. Teasing: What if in the future we'll have so much memory that we could afford to assign a process a few TiB or PiB unconditionally without worrying if it's effectively used or not? An enlarged TLB combined with very huge pages (talking about TiB or more) will allow the OS the allocate all the memory of a process at creation time. Pages will rarely be swapped out. Just teasing of course :) – Thermel 1/10, 2017 at 16:36

@MargaretBloom: Yeah, if the hardware had mode where 4k pages didn't exist, the penalty for 4k-split loads/stores might disappear (and only happen when splitting across two of the smallest possible pages). Currently, a 4k-split has the same penalty on Skylake (and presumably everything else) even if it's fully contained in a 2M hugepage vs. split across a 2M boundary or the boundary between two actual 4k pages. (With TLB and L1D hot for all cases.) i.e. HW detection of a 4k-split happens without waiting for a TLB check to finish. – Febrifacient 1/10, 2017 at 17:1

@MargaretBloom The author describe the architecture as "Intel 64 architecture: also known as x86_64 and AMD64". – Antilog 1/10, 2017 at 17:20

@MargaretBloom My bad, I miss interpreted. It is 9 bit as you said: 4*9bit (each index) + 12bit (offset) = 48bit. – Antilog 1/10, 2017 at 21:27

Update: Intel has published a 5-level page table extension en.wikipedia.org/wiki/Intel_5-level_paging that extends virtual addresses to 57 bits, but the last 4 levels are the same (so phys addr width = 52 still). Sometimes called PML5: software.intel.com/sites/default/files/managed/2b/80/… has full details. CPUs that support it will also support standard PML4 tables; you have to set a bit in CR4 to enable PML5 mode (before entering long mode). – Febrifacient 29/4, 2019 at 21:41

AMD rushed to be the first to ship an x86 CPU with 64-bit - reusing PAE is also yet another case of AMD enabling hardware similarity between modes, e.g. the page-walker hardware only needs to handle legacy vs. PAE, not also a 3rd format, to fully / efficiently support legacy mode and long mode. And also rushing in the sense of trying to make things as similar as possible for OSes, making it easier for existing OSes to use existing code that knows about that page-table format. (And in other design choices, for compilers to use similar code-gen, all insns work the same way, even ugly ones) – Febrifacient 24/5, 2021 at 21:36

The earlier answer says

Certainly, the size of the physical address space has some hardware cost associated with it: more pins [...] and more space in the caches/TLBs.

which suggests a misconception in the author's mind: that x86-64 CPUs actually have enough pins to address 2⁵² bytes of RAM.

In reality, no CPU ever released has had close to that much physical address space. The sockets don't support it, and (therefore) they don't need bits in the cache or TLB for it either.

The only sense in which the address space is 52 bits is that some of the bits in the page table entries are marked as reserved (meaning the OS must set them to 0) while others are marked as ignored (meaning the OS can use them for its own purposes). There are just enough reserved bits to extend the physical address space to 2⁵² bytes in the future—though they could also be assigned other roles, in principle.

The tradeoff of assigning bits as reserved/ignored is:

Fewer ignored bits means that OSes can store less information there, which might make them slower in the present day.
Fewer reserved bits means that the page table entry format might have to be changed yet again when the limit of physical address space is hit, years down the line.

32-bit x86 CPUs for years had a 36-bit physical address space, so it is possible to have a physical address space larger than the virtual, but it is awkward at the OS level. I don't believe there are any plans to release an x86-64 CPU with a physical address space larger than the virtual. Intel recently introduced 5-level paging, which increases the virtual address space to 2⁵⁷ bytes. Their white paper says, of the physical address size of Intel processors as returned by CPUID with EAX=80000008h:

Processors that support Intel 64 architecture have enumerated at most 46 for this value. Processors that support 5-level paging are expected to enumerate higher values, up to 52.

I gather from this that they have no plans at the moment to change the page table format to support more than 2⁵² bytes of RAM, and they also have no plans to support a physical address space larger than 1/4 of the virtual address space. The latter makes sense because only half of the virtual address space is intended for the kernel, and having that be entirely filled with RAM would probably be inconvenient.

AMD's architecture manual, volume 2, rev. 3.38 (November 2021) says

[T]he page-translation mechanism can be extended to support 52-bit physical addresses. [...] Currently, the AMD64 architecture supports 40-bit addresses in this mode, allowing up to 1 terabyte of physical-address space to be supported.

AMD doesn't appear to have 5-level paging yet.

Monitorial answered 25/7, 2022 at 19:49 Comment(1)

Indeed, I don't expect it would be popular with OS developers (notably Linus Torvalds who has ranted about how much PAE sucks) to return to an era of kernels not having enough virtual address space to map all of physical RAM. Why 4-level paging can only cover 64 TiB of physical address - Linux wants 4x as much virtual address space as DRAM, and doesn't support HIGHMEM for 64-bit mode. – Febrifacient 25/7, 2022 at 20:43

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags