Observing stale instruction fetching on x86 with self-modifying code
Asked Answered
G

4

41

I've been told and have read from Intel's manuals that it is possible to write instructions to memory, but the instruction prefetch queue has already fetched the stale instructions and will execute those old instructions. I have been unsuccessful in observing this behavior. My methodology is as follows.

The Intel software development manual states from section 11.6 that

A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction.

So, it looks like if I hope to execute stale instructions, I need to have two different linear addresses refer to the same physical page. So, I memory map a file to two different addresses.

int fd = open("code_area", O_RDWR | O_CREAT, S_IRWXU | S_IRWXG | S_IRWXO);
assert(fd>=0);
write(fd, zeros, 0x1000);
uint8_t *a1 = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC,
        MAP_FILE | MAP_SHARED, fd, 0);
uint8_t *a2 = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC,
        MAP_FILE | MAP_SHARED, fd, 0);
assert(a1 != a2);

I have an assembly function that takes a single argument, a pointer to the instruction I want to change.

fun:
    push %rbp
    mov %rsp, %rbp

    xorq %rax, %rax # Return value 0

# A far jump simulated with a far return
# Push the current code segment %cs, then the address we want to far jump to

    xorq %rsi, %rsi
    mov %cs, %rsi
    pushq %rsi
    leaq copy(%rip), %r15
    pushq %r15
    lretq

copy:
# Overwrite the two nops below with `inc %eax'. We will notice the change if the
# return value is 1, not zero. The passed in pointer at %rdi points to the same physical
# memory location of fun_ins, but the linear addresses will be different.
    movw $0xc0ff, (%rdi)

fun_ins:
    nop   # Two NOPs gives enough space for the inc %eax (opcode FF C0)
    nop
    pop %rbp
    ret
fun_end:
    nop

In C, I copy the code to the memory mapped file. I invoke the function from linear address a1, but I pass a pointer to a2 as the target of the code modification.

#define DIFF(a, b) ((long)(b) - (long)(a))
long sz = DIFF(fun, fun_end);
memcpy(a1, fun, sz);
void *tochange = DIFF(fun, fun_ins);
int val = ((int (*)(void*))a1)(tochange);

If the CPU picked up the modified code, val==1. Otherwise, if the stale instructions were executed (two nops), val==0.

I've run this on a 1.7GHz Intel Core i5 (2011 macbook air) and an Intel(R) Xeon(R) CPU X3460 @ 2.80GHz. Every time, however, I see val==1 indicating the CPU always notices the new instruction.

Has anyone experience with the behavior I want to observe? Is my reasoning correct? I'm a little confused about the manual mentioning P6 and Pentium processors, and what the lack of mentioning my Core i5 processor. Perhaps something else is going on that causes the CPU to flush its instruction prefetch queue? Any insight would be very helpful!

Gentianaceous answered 30/6, 2013 at 22:52 Comment(4)
What is the manual your used (check the "order number" on the first page and write it here)?Winwaloe
Also check "8.1.3 Handling Self- and Cross-Modifying Code" section of instruction manual - download.intel.com/products/processor/manual/325462.pdfWinwaloe
Hmm, try to unset PROT_EXEC from a2... This can affect some Intel AtomsWinwaloe
"Pentium" or P5 is a particular implemention, sold 1993-1996. P6 is another implementation and was marketed/sold as "PentiumPro", 1995-1999. You'll have a tough time finding either today.Mischance
W
37

I think, you should check the MACHINE_CLEARS.SMC performance counter (part of MACHINE_CLEARS event) of the CPU (it is available in Sandy Bridge 1, which is used in your Air powerbook; and also available on your Xeon, which is Nehalem 2 - search "smc"). You can use oprofile, perf or Intel's Vtune to find its value:

http://software.intel.com/sites/products/documentation/doclib/iss/2013/amplifier/lin/ug_docs/GUID-F0FD7660-58B5-4B5D-AA9A-E1AF21DDCA0E.htm

Machine Clears

Metric Description

Certain events require the entire pipeline to be cleared and restarted from just after the last retired instruction. This metric measures three such events: memory ordering violations, self-modifying code, and certain loads to illegal address ranges.

Possible Issues

A significant portion of execution time is spent handling machine clears. Examine the MACHINE_CLEARS events to determine the specific cause.

SMC: http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/amplifierxe/win/win_reference/snb/events/machine_clears.html

MACHINE_CLEARS Event Code: 0xC3 SMC Mask: 0x04

Self-modifying code (SMC) detected.

Number of self-modifying-code machine clears detected.

Intel also says about smc http://software.intel.com/en-us/forums/topic/345561 (linked from Intel Performance Bottleneck Analyzer's taxonomy

This event fires when self-modifying code is detected. This can be typically used by folks who do binary editing to force it to take certain path (e.g. hackers). This event counts the number of times that a program writes to a code section. Self-modifying code causes a severe penalty in all Intel 64 and IA-32 processors. The modified cache line is written back to the L2 and LLC caches. Also, the instructions would need to be re-loaded hence causing performance penalty.

I think, you will see some such events. If they are, then CPU was able to detect act of self-modifying the code and raised the "Machine Clear" - full restart of pipeline. First stages are Fetch and they will ask L2 cache for new opcode. I'm very interested in the exact count of SMC events per execution of your code - this will give us some estimate about latencies.. (SMC is counted in some units where 1 unit is assumed to be 1.5 cpu cycles - B.6.2.6 of intel optimization manual)

We can see that Intel says "restarted from just after the last retired instruction.", so I think last retired instruction will be mov; and your nops are already in the pipeline. But SMC will be raised at mov's retirement and it will kill everything in pipeline, including nops.

This SMC induced pipeline restart is not cheap, Agner has some measurements in the Optimizing_assembly.pdf - "17.10 Self-modifying code (All processors)" (I think any Core2/CoreiX is like PM here):

The penalty for executing a piece of code immediately after modifying it is approximately 19 clocks for P1, 31 for PMMX, and 150-300 for PPro, P2, P3, PM. The P4 will purge the entire trace cache after self-modifying code. The 80486 and earlier processors require a jump between the modifying and the modified code in order to flush the code cache. ...

Self-modifying code is not considered good programming practice. It should be used only if the gain in speed is substantial and the modified code is executed so many times that the advantage outweighs the penalties for using self-modifying code.

Usage of different linear addresses to fail SMC detector was recommended here: https://mcmap.net/q/14534/-how-is-x86-instruction-cache-synchronized - I'll try to find actual intel documentation... Can't actually answer to your real question now.

There may be some hints here: Optimization manual, 248966-026, April 2012 "3.6.9 Mixing Code and Data":

Placing writable data in the code segment might be impossible to distinguish from self-modifying code. Writable data in the code segment might suffer the same performance penalty as self-modifying code.

and next section

Software should avoid writing to a code page in the same 1-KByte subpage that is being executed or fetching code in the same 2-KByte subpage of that is being written. In addition, sharing a page containing directly or speculatively executed code with another processor as a data page can trigger an SMC condition that causes the entire pipeline of the machine and the trace cache to be cleared. This is due to the self-modifying code condition.

So, there is possibly some schematics which controls intersections of writable and executable subpages.

You can try to do modification from the other thread (cross-modifying code) -- but the very careful thread synchronization and pipeline flushing is needed (you may want to include some brute-forcing of delays in writer thread; CPUID just after the synchronization is desired). But you should know that THEY already fixed this using "nukes" - check US6857064 patent.

I'm a little confused about the manual mentioning P6 and Pentium processors

This is possible if you had fetched, decoded and executed some stale version of intel's instruction manual. You can reset the pipeline and check this version: Order Number: 325462-047US, June 2013 "11.6 SELF-MODIFYING CODE". This version still not says anything about newer CPUs, but mentions that when you are modifying using different virtual addresses, the behavior may be not compatible between microarchitectures (it may work on your Nehalem/Sandy Bridge and may not work on .. Skymont)

11.6 SELF-MODIFYING CODE A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction. For the Pentium 4 and Intel Xeon processors, a write or a snoop of an instruction in a code segment, where the target instruction is already decoded and resident in the trace cache, invalidates the entire trace cache. The latter behavior means that programs that self-modify code can cause severe degradation of performance when run on the Pentium 4 and Intel Xeon processors.

In practice, the check on linear addresses should not create compatibility problems among IA-32 processors. Applications that include self-modifying code use the same linear address for modifying and fetching the instruction.

Systems software, such as a debugger, that might possibly modify an instruction using a different linear address than that used to fetch the instruction, will execute a serializing operation, such as a CPUID instruction, before the modified instruction is executed, which will automatically resynchronize the instruction cache and prefetch queue. (See Section 8.1.3, “Handling Self- and Cross-Modifying Code,” for more information about the use of self-modifying code.)

For Intel486 processors, a write to an instruction in the cache will modify it in both the cache and memory, but if the instruction was prefetched before the write, the old version of the instruction could be the one executed. To prevent the old instruction from being executed, flush the instruction prefetch unit by coding a jump instruction immediately after any write that modifies an instruction

REAL Update, googled for "SMC Detection" (with quotes) and there are some details how modern Core2/Core iX detects SMC and also many errata lists with Xeons and Pentiums hanging in SMC detector:

  1. http://www.google.com/patents/US6237088 System and method for tracking in-flight instructions in a pipeline @ 2001

  2. DOI 10.1535/itj.1203.03 (google for it, there is free version at citeseerx.ist.psu.edu) - the "INCLUSION FILTER" was added in Penryn to lower number of false SMC detections; the "existing inclusion detection mechanism" is pictured on Fig 9

  3. http://www.google.com/patents/US6405307 - older patent on SMC detection logic

According to patent US6237088 (FIG5, summary) there is "Line address buffer" (with many linear addresses one address per fetched instruction -- or in other word the buffer full of fetched IPs with cache-line precision). Every store, or more exact "store address" phase of every store will be feed into parallel comparator to check, will store intersects to any of currently executing instructions or not.

Both patents don't clearly say, will they use physical or logical address in SMC logic... L1i in Sandy bridge is VIPT (Virtually indexed, physically tagged, virtual address for the index and physical address in the tag. ) according to http://nick-black.com/dankwiki/index.php/Sandy_Bridge so we have the physical address at time when L1 cache returns data. I think intel may use physical addresses in SMC detection logic.

Even more, http://www.google.com/patents/US6594734 @ 1999 (published 2003, just remember that CPU design cycle is around 3-5 years) says in the "Summary" section that SMC now is in TLB and uses physical addresses (or in other word - please, don't try to fool SMC detector):

Self modifying code is detected using a translation lookaside buffer .. [which] has physical page addresses stored therein over which snoops can be performed using the physical memory address of a store into memory. ... To provide finer granularity than a page of addresses, FINE HIT bits are included with each entry in the cache associating information in the cache to portions of a page within memory.

(portion of page, referred to as quadrants in the patent US6594734, sounds like 1K subpages, isn't it?)

Then they says

Therefore snoops, triggered by store instructions into memory, can perform SMC detection by comparing the physical address of all instructions stored within the instruction cache with the address of all instructions stored within the associated page or pages of memory. If there is an address match, it indicates that a memory location was modified. In the case of an address match, indicating an SMC condition, the instruction cache and instruction pipeline are flushed by the retirement unit and new instructions are fetched from memory for storage into the instruction cache.

Because snoops for SMC detection are physical and the ITLB ordinarily accepts as an input a linear address to translate into a physical address, the ITLB is additionally formed as a content-addressable memory on the physical addresses and includes an additional input comparison port (referred to as a snoop port or reverse translation port)

-- So, to detect SMC, they force the stores to forward physical address back to instruction buffer via snoop (similar snoops will be delivered from other cores/cpus or from DMA writes to our caches....), if snoop's phys. address conflicts with cache lines, stored in instruction buffer, we will restart pipeline via SMC signal delivered from iTLB to retirement unit. Can imagine how much cpu clocks will be wasted in such snoop loop from dTLB via iTLB and to retirement unit (it can't retire next "nop" instruction, although it was executed early than mov and has no side effects). But WAT? ITLB has physical address input and second CAM (big and hot) just to support and defend against crazy and cheating self-modifying code.

PS: And what if we will work with huge pages (4M or may be 1G)? The L1TLB has huge page entries, and there may be a lot of false SMC detects for 1/4 of 4 MB page...

PPS: There is a variant, that the erroneous handling of SMC with different linear addresses was present only in early P6/Ppro/P2...

Winwaloe answered 30/6, 2013 at 22:52 Comment(3)
+1 for throwing "air powerbook" into such a precise discussion of Intel minutiae :)Endocardium
If you look further, you will see patents relating to SMC on which I am an inventor. AFAIK I invented the I$ and ITLB inclusion mechanisms of P6 to snoop "instructions in flight". // I consider these to be mistakes. I think that it would have been easier to create to create a fully associative CAM with the instruction blocks of all instructions in the pipeline, physical. With a bloom filter if you want to save power. // I think they were mistakes (a) because complicated and hard to get correct, even though they saved a lot of gates, and (b) glass jaws in performance.Informative
What if the code you modified was always, say, 1k bytes ahead of the current execution address? Would that avoid the problem?Bandoleer
I
20

I've been told and have read from Intel's manuals that it is possible to write instructions to memory, but the instruction prefetch queue has [may have] already fetched the stale instructions and will [may] execute those old instructions. I have been unsuccessful in observing this behavior.

Yes, you would be.

All or almost all modern Intel processors are stricter than the manual:

They snoop the pipeline based on physical address, not just linear.

Processor implementations are allowed to be stricter than the manuals.

They may choose to be so because they have encountered code that does not adhere to the rules in the manuals, that they do not want to break.

Or... because the easiest way to adhere to the architectural specification (which in the case of SMC used to be officially "up until the next serializing instruction" but in practice, for legacy code, was "up until the next taken branch that is more than ??? bytes away") might be to be stricter.

Informative answered 22/8, 2013 at 18:55 Comment(21)
Another interesting example of CPUs going above and beyond the requirements specified in the x86 ISA manual: Coherent page-walks for TLB entries that could only have been speculatively loaded, to avoid breaking Win9x. AMD dropped coherency with Bulldozer, since I guess they decided Win9x and other software that depended on this was no longer relevant.Motherland
I've been meaning to elaborate @PeterCordes' comment about "coherent" page table walking for TLB misses, but I will be quick: (1) the main reason Intel started running the page table walks through the cache, rather than bypassing the cache, was performance. Prior to P6 page table walks were slow, not benefitting from cache, and were non-speculative. Slow enough that software TLB miss handling was a performance win. P6 sped TLB misses up by doing them speculatively, using the cache, and also by caching intermediate nodes like page directory entries.Informative
(1a) By the way, AMD was reluctant to do TLB miss handling speculatively. I think because they were influenced by DEC VAX Alpha architects. One of the DEC Alpha architects told me rather emphatically that speculative handling of TLB misses, such as P6 was doing, was incorrect and would never work. When I arrived at AMD circa 2002 they still had something called a "TLB Fence" - not a fence instruction, but a point in the rop or microcode sequence where TLB misses either could or could not be allowed to happen - I am afraid that I do not remember exactly how it worked.Informative
(1a') so I think that it is not so much that Bulldozer abandoned TLB and page table walking coherency, whatever that means, as that Bulldozer may have been the first AMD machine to do moderately aggressive TLB miss handling.Informative
(1b) recall that when P6 was started P5 was not shipping: the existing x86es all did cache bypass page table walking in-order, non-speculatively, no asynchronous prefetches, but on write through caches. I.e. They WERE cache coherent, and the OS could rely on deterministic replacement of TLB entries. IIRC I wrote those architectural rules about speculative and non-deterministic cacheability, both for TLB entries and for data and instruction caches. You can't blame OSes like Windows and UNIX and Netware for not following page table and TLB management rules that did not exist at the time.Informative
(2) one of my biggest regrets wrt P6 is that we did not provide Intra-instruction TLB consistency support. Some instructions access the same page more than once. It was possible for different uops in the same instruction to get different translations for the same address. If we had given microcode the ability to save a physical address translation, and then use that, things would have been better IMHO.Informative
(2a) I was a RISC proponent when I joined P6, and my attitude was "let SW (microcode) do it".Informative
(2a') one of the most embarrassing bugs was related to add-with-carry to memory. In early microcode. The load would go, the carry flag would be updated, and the store could fault -but the carry flag had already been updated, so the instruction could not be restarted. // it was a simple microcode fix, doing the store before the carry flag was written - but one extra uop was enough to make that instruction not fit in the "medium speed" ucode system.Informative
(3) Anyway - the main "support" P6 and its descendants gave to handling TLB coherency issues was to rewalk the page tables at retirement before reporting a fault. This avoided confusing the OS by reporting a fault when the page tables said there should not be one.Informative
(4) meta comment: I don't think that any architecture has properly defined rules for caching of invalid TLB entries. // AFAIK most processors do not cache invalid TLB entries - except possibly Itanium with its NAT (Not A Thing) pages. But there's a real need: speculative memory accesses may be to wild addresses, miss the TLB, do an expensive page table walk, slowing down other instructions and threads - and then doing it over and over again because the fact that "this is a bad address, no need to walk the page tables" is not remembered. // I suspect that DOS attacks could use this.Informative
(4') worse, OSes may make implicit assumptions that invalid translations are never cached, and therefore not do a TLB invalidation or MP TLB shoot down when transitioning from invalid to valid. // Worse^2: imagine that you are caching interior nodes of the page table cache. Imagine that PD contains all invalid PDE; worse^3, that the PD contains valid d PDEs that point to PTs that are all invalid. Are you still allowed to cache those PDEs? Exactly when does the OS need to invalidate an entry?Informative
(4'') because MP TLB shoot downs using interprocessor interrupts were expensive, OS performance guys (like I used to be) are always making arguments like "we don't need to invalidate the TLB after changing a PTE from invalid to valid" or "from valid read-only to valid writable with a different address". Or "we don't need to invalidate the TLB after changing a PDE to point to a different PT whose PTEs are exactly the same as the original PT...". // Lots of great ingenious arguments. Unfortunately not always correct.Informative
Some of my computer architect friends now espouse coherent TLBs: TLBs that snoop writes just like data caches. Mainly to allow us to build even more aggressive TLBs and page table caches, if both valid and invalid entries of leaf and interior nodes. And not to have to worry about OS guys' assumptions. // I am not there yet: too e pensive for low end hardware. But might be worth doing at high end.Informative
Thanks, Andy. This is some great history! I feel like it belongs in an answer somewhere, either this one as a giant aside, or maybe a self-answered Q&A if we can think of a good "question" that has this as the answer :P Holy crap, so that's where that extra ALU uop comes from in memory-destination ADC, even on Core2 and SnB-family? Never would have guessed, but had been puzzled by it.Motherland
Yep: often when you "do the RISC thing" extra instructions or micro instructions are required, in a careful order. Whereas if you have "CISCy" support, like special hardware support so that a single instruction is a transaction, either all done or all not done, shorter code sequences can be used.Informative
Something similar applies to self modifying code: it was not so much that we wanted to make self modifying code run fast, as that trying to make the legacy mechanisms for self modifying code - draining the pipe for serializing instructions like CPUID - were slower than just snooping the Icache and pipeline. But, again, this applies to a high end machine: on a low end machine, the legacy mechanisms are fast enough and cheap.Informative
Ditto memory ordering. High end snooping faster; low end draining cheaper.Informative
It is hard to maintain this dichotomy.Informative
It is pretty common that a particular implementation has to implement rules compatible with but stronger than the architectural statement. But not all implementations have to do it the same way.Informative
Cute: what I said about "DOS Attacks" relating to caching of invalid translations can go both ways. I meant that malware running on one thread could try to deny service to a different thread on the same CPU. // But also: the recent IoT DOS attack on dyn.com might have been helped by DNS servers caching, not just a single invalid hostname, but something like a Bloom filter where valid addresses were concentrated in a few bits, and the invalid host names that the attackers generated wound hopefully map to zero bits in the Bloom filter.Informative
I quoted your comments into an old answer of mine about TLB miss handling, where maybe more people will see this interesting stuff. Thanks again for the look behind the scenes of why stuff was designed the way it is, and the history.Motherland
M
6

Sandybridge-family (at least Skylake) still has the same behaviour, apparently snooping on physical address.

Your test is somewhat overcomplicated, though. I don't see the point of the far jump, and if you assemble (and link if necessary) the SMC function into a flat binary you can just open + mmap it twice. Make a1 and a2 function pointers, then main can return a1(a2) after mapping.

Here's a simple test harness, in case anyone wants to try on their own machine: (The open/assert/mmap block was copied from the question, thanks for the starting point.)

(Downside, you have to rebuild the SMC flat binary every time, because mapping it with MAP_SHARED actually modifies it. IDK how to get two mappings of the same physical page that won't modify the underlying file; writing to a MAP_PRIVATE would COW it to a different physical page. So writing the machine code to a file and them mapping it makes sense now that I realize this. But my asm is still a lot simpler.)

// smc-stale.c
#include <sys/mman.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>

typedef int (*intfunc_t)(void *);   // __attribute__((sysv_abi))  // in case you're on Windows.

int main() {
    int fd = open("smc-func", O_RDWR);

    assert(fd>=0);
    intfunc_t a1 = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC,
                MAP_FILE | MAP_SHARED, fd, 0);
    intfunc_t a2 = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC,
                MAP_FILE | MAP_SHARED, fd, 0);
    assert(a1 != a2);
    return a1(a2);
}

NASM source for the test function:

(See How to generate plain binaries like nasm -f bin with the GNU GAS assembler? for an as+ld alternative to nasm -f)

;;build with nasm smc-func.asm     -fbin is the default.
bits 64
entry:   ; rdi = another mapping of the same page that's executing
    mov  byte [rdi+dummy-entry], 0xcc       ; trigger any copy-on-write page fault now

    mov  r8, rbx    ; CPUID steps on call-preserved RBX
    cpuid               ; serialize for good measure
    mov  rbx, r8
;    mfence
;    lfence

    mov   dword [rdi + retmov+1 - entry],  0       ; return 0 for snooping
retmov:
    mov   eax, 1      ; opcode + imm32             ; return 1 for stale
    ret

dummy:  dd 0xcccccccc

On an i7-6700k running Linux 4.20.3-arch1-1-ARCH, we do not observe stale code fetch. The mov that overwrote the immediate 1 with a 0 did modify that instruction before it ran.

peter@volta:~/src/experiments$ gcc -Og -g smc-stale.c
peter@volta:~/src/experiments$ nasm smc-func.asm && ./a.out; echo $?
0
# remember to rebuild smc-func every time, because MAP_SHARED modifies it
Motherland answered 8/2, 2019 at 6:32 Comment(1)
This was initially a little confusing to me because in the original question and code, a return value of 0 indicates the stale instruction was executed, but you have flipped it in your code so that 1 indicates the stale instruction, 0 indicates that the instruction was modified before execution. However, the conclusion is the same: no stale instruction execution is observed. Thanks for the contribution!Viridity
A
3

It targets a much older CPU (the Intel 8088) but the 4 channel music player at the end of the 8088 MPH demo not only executes stale instructions but depends on them being stale! https://www.reenigne.org/blog/8088-pc-speaker-mod-player-how-its-done/

Acariasis answered 11/12, 2022 at 17:46 Comment(1)
Fun fact: stale instruction fetch was one way to detect 8086 vs. 8088, as mentioned in an answer on Why did x86 support self-modifying code in the 80s and 90s?. 8086 had a 6 byte prefetch queue, vs. 4 on 8088.Motherland

© 2022 - 2024 — McMap. All rights reserved.