Do x86 instructions require their own encoding as well as all of their arguments to be present in memory at the same time?
Asked Answered
S

1

69

I am trying to figure out whether it is possible to run a Linux VM whose RAM is only backed by a single physical page.

To simulate this, I modified the nested page fault handler in KVM to remove the present bit from all nested page table (NPT) entries, except the one corresponding to the currently processed page fault.

While trying to start a Linux guest, I observed that assembly instructions that use memory operands, like

add [rbp+0x820DDA], ebp

lead to a page fault loop until I restore the present bit for the page containing the instruction as well as for the page referenced in the operand (in this example [rbp+0x820DDA]).

I am wondering why this is the case. Shouldn't the CPU access the memory pages sequentially, i.e. first read the instruction and then access the memory operand? Or does x86 require that the instruction page as well as all operand pages are accessible at the same time?

I am testing on AMD Zen 1.

Substratosphere answered 1/4, 2020 at 10:25 Comment(4)
Why would you want to do this?Amused
Just out of technical interest :)Substratosphere
This is insane on the level of "boot Linux on a 486 emulator running in JavaScript in the browser". I love it.Etom
Heh, apparently I took this question to the same logical conclusion you were already thinking, about the minimum working set for guaranteed forward progress. I had already answered that before you added that new first paragraph to the question. :P I added some links and more details in a few spots (e.g. the page-walker is allowed to cache some guest page-directory entries internally) since this question is getting way more attention than I expected thanks to somehow making it to HNQ.Cordero
C
60

Yes, they do require the machine code and all memory operands.

Shouldn't the CPU access the memory pages sequentially, i.e. first read the instruction and then access the memory operand?

Yes that's logically what happens, but a page-fault exception interrupts that 2-step process and discards any progress. The CPU doesn't have any way to remember what instruction it was in the middle of when a page-fault occurred.

When a page-fault handler returns after handling a valid page fault, RIP= the address of the faulting instruction, so the CPU retries executing it from scratch.

It would be legal for the OS to modify the machine code of the faulting instruction and expect it to execute a different instruction after iret from the page-fault handler (or any other exception or interrupt handler). So AFAIK it's architecturally required that the CPU redoes code-fetch from CS:RIP in the case you're talking about. (Assuming it even does return to the faulting CS:RIP instead of scheduling another process while waiting for disk on hard page fault, or delivering a SIGSEGV to a signal handler on an invalid page fault.)

It's probably also architecturally required for hypervisor entry/exit. And even if it's not explicitly forbidden on paper, it's not how CPUs work.

@torek comments that Some (CISC) microprocessors partially decode instructions and dump microregister state on a page fault, but x86 is not like that. m68k does do that, for example, letting it page-fault in the middle of a memory-indirect store after some but not all bytes of a misaligned store have made it to memory, overwriting the pointer the memory-indirect addressing mode used. (x86 doesn't have memory-indirect addressing, and faults are taken at instruction boundaries.)


A few instructions are interruptible and can make partial progress, like rep movs (memcpy in a can) and other string instructions, or gather loads/scatter stores. But the only mechanism is updating architectural registers like RCX / RSI / RDI for string ops, or the destination and mask registers for gathers / scatters (such as AVX2 vpgatherdd). Not keeping the opcode / decode results in some hidden internal register and restarting it after iret from a page fault handler. These are instructions that do multiple separate data accesses.

The semantics for those instructions updating the architectural register state when a fault happens are such that re-executing from scratch with the new reg values produces the same final result as if it had executed the whole way through without faulting.

Also keep in mind that x86 (like most ISAs) guarantees that instructions are atomic wrt. interrupts and also exceptions: they either fully happen, or don't happen at all, before an interrupt. Interrupting an assembly instruction while it is operating. So for example add [mem], reg would be required to discard the load if the store part faulted, even without a lock prefix.

(ISAs like m68k that save/restore microarchitectural state on the stack for partial progress still only take external interrupts at instruction boundaries, but not exceptions like page faults. Apparently the worst-case number of pages a single m68k instruction can touch is 16, but they don't all have to be resident at once, so it's a bit closer to an x86 gather or rep movs where partial progress gets saved.)


Minimum number of resident pages to ensure forward progress

The worst case number of guest user-space pages present to make forward progress might be 6 (plus separate guest-kernel page-table subtrees for each one):

  • movsq or movsw 2-byte instruction spanning a page boundary, so both pages are needed for it to decode.
  • qword source operand [rsi] also a page-split
  • qword destination operand [rdi] also a page-split

If any of these 6 pages fault, we're back to square one.

rep movsd is also a 2-byte instruction, and making progress on one step of it would have the same requirement. Similar cases like push [mem] or pop [mem] could be constructed with a misaligned stack.

One of the reasons (or side benefits) for/of making gather loads / scatter stores "interruptible" (updating the mask vector with their progress) is to avoid increasing this minimum footprint to execute a single instruction. Also to improve efficiency of handling multiple faults during one gather or scatter.

As pointed out in comments, the iret in the guest kernel's page-fault handler presumably also has to present, but if we're talking about nested page tables and the hypervisor playing tricks, it could swap out a guest-kernel page in favour of the 6 guest pages needed for rep movsd itself to make progress. Those page-faults are to the hypervisor, not to the guest kernel that thinks it has all the pages resident plus its own kernel memory.


@Brandon points out in comments that a guest will need its page tables in memory, and the user-space page splits can also be 1GiB splits so the two sides are in different sub-trees of the top level PML4. HW page walk will need to touch all of these guest page-table pages to make progress. A situation this pathological is unlikely to happen by chance.

The TLB (and page-walker internals) are allowed to cache some of the page-table data, and aren't required to restart page-walk from scratch unless the OS did invlpg or set a new CR3 top-level page directory. Neither of these are necessary when changing a page from not-present to present; x86 on paper guarantees that it's not needed (so "negative caching" of not-present PTEs isn't allowed, at least not visible to software). So the CPU might not VMexit even if some of the guest-physical page-table pages are not actually present.

PMU performance counters can be enabled and configured such that the instruction also requires a perf event to a write into a PEBS buffer for that instruction. With a counter's mask configured to count only user-space instructions, not kernel, it could well be that it keeps trying to overflow the counter and store a sample in the buffer every time you return to userspace, producing a page-fault.

Cordero answered 1/4, 2020 at 10:41 Comment(9)
Worst case for a single instruction might be something like "push dword [foo" (or even just call [foo]) with everything misaligned across "page directory pointer table boundary" (adding up to 6 pages, 6 page tables, 6 page directories, 6 PDPTs and one PML4); with the CPU's "precise event based sampling with PEBS buffer" feature enabled and configured so that the push causes performance monitoring data to be added to the PEBS buffer. For a conservative "minimum pages provided by host so guest can make progress in pathological cases" I'd want at least 16 pages.Coypu
Note that this sort of thing has always been common in CISC-y architectures. Some microprocessors partially decode instructions and dump microregister state on a page fault, but others don't and/or require that address operands for "loop-y" instructions (DBRA on m68k, MOVC3/MOVC5 on Vax, etc) be in registers similar to your REP MOVS example.Lith
@Brendan: someone counted a worst case on a VAX instruction as about 50 pages. I forget the details, but you'd obviously put the instruction itself on a page boundary, use something like the translate-table lookup with the table spanning a page boundary, use (rX)[rY] with the indirects at page boundaries, and so on. The hairiest instructions took up to 6 operands (loading them into r0-r5) and all six could be double indirects, I think.Lith
The OS could change the instruction, but it can also change EIP. So there's a logical follow-up question. What's the minimum number of pages needed, assuming an intelligent instruction patch scheme? E.g. copy the unaligned value to an aligned scratch buffer, emulate the instruction, and IRET to the next instruction.Lieselotteliestal
The page containing the OS's iretinstruction need also be in memory. This is a one-byte instruction, so one extra page. The page fault handler interrupt address need also be in memory, but that can be the same page as above.Changchangaris
@StigHemmer: If there's no #PF exception, the hypervisor can have the guest kernel's pages not present. Successful page walk doesn't involve the kernel at all, just the page table data structures. And if we're not talking about a guest under a hypervisor, then normally key parts of kernel memory aren't paged at all, i.e. are pinned. But sure, we could imagine a tiny kernel that fits its code into one 4k page, including the GDT, IDT (interrupt descriptor table) and #PF handler.Cordero
I must admit I fell off the x86 train before the Hypervisor arrived. I'll take your word for it.Changchangaris
@StigHemmer: I think the question reduces to how many pages have to be present for an instruction to make progress without page-faulting, so not counting any (guest-)kernel stuff other than page table data. You do raise a good point about actual practical footprint of (guest-)physical pages including the kernel, though. Guest physical memory is virtualized by a hypervisor, with HW support from nested page tables. I don't usually worry about that; I run Linux on bare metal on my desktop because hypervisors don't expose HW perf counters well / at all.Cordero
@StigHemmer: A 64-bit OS normally will use iretq (at least when returning to 64-bit user-space), which needs a REX.w prefix. Unfortunately(?) that's not the default operand-size. (This is relevant if we are talking about the non-hypervisor case and #PF to the OS, not to the hypervisor.) It's hypothetically possible a kernel will have one of its few iretq instructions split across a page boundary.Cordero

© 2022 - 2024 — McMap. All rights reserved.