X64 instructions that behave differently on different CPUs

Asked 15/11, 2016 at 23:56 Answered 16/11, 2016 at 2:59

During an interview I was asked if I knew x64 instructions that behave differently depending on the CPU used, I couldn't find any documentation on that anywhere, does anyone know what these instructions are and why this is the case?

Alkyne answered 15/11, 2016 at 23:56 Comment(3)

@PeterCordes The answer to this question depends much on the precise definition of "instruction" and "CPU". For instance, cpuid is a dead giveaway, but on P4 it controversially exposed the serial number. – Sebrinasebum 16/11, 2016 at 0:13

There are not a lot of companies left that bake x86 and x64 processor chips, AMD and Intel are the only real survivors. x64 was invented by AMD, later adopted by Intel. They did not always agree how to extend the architecture, 3DNow and MMX are pretty dead. Having learned from their mistakes, they did get their act together in userland code. But protected mode and virtualization, the kind of thing that an OS cares about, is still quite different. Yes, you could have easily answered cpuid :) – Suh 16/11, 2016 at 0:37

MMX is I think a required baseline part of x86-64, since x86-64 requires SSE and SSE2, and some SSE instructions operate on MMX registers. So we can expect CPUs to keep supporting it basically forever, but even on Skylake the MMX versions of some instructions have lower throughput than their XMM equivalents. MMX support in future CPUs could become even more vestigial, and be micro-coded or something. (Contrast with 3DNow, which is already dead and completely unsupported by most CPUs, including even AMD's newer CPUs.) – Latham 16/11, 2016 at 3:2

There are some that leave a register or some flags with undefined values. Intel and AMD may differ there.

In some cases, the actual behaviour of real hardware for these undefined cases preserves backwards compatibility for some old software that relies on it. For example, BSF with input=0 sets ZF and leaves the destination register unmodified. (On both current Intel and AMD hardware. IDK if any old Intel hardware was ever different, if no, bsf/bsr isn't really an example of an instruction that executes differently, just a lack of documented guarantees of being future-proof.)

But the difference is that Intel documents it as leaving the destination register with "undefined" contents. AMD's manuals explicitly document and guarantee that AMD CPUs will leave the destination unmodified in that case.

AMD's AMD64 manual (March 2017) for bsr/bsf:
If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register

So it's not guaranteed on paper that it's safe to emulate tzcnt / implement std::countr_zero as mov eax, 32 / bsf eax, edx, even though that works in practice and will likely continue working on future CPUs. (This is why bsf / bsr have an output dependency.) Intel might eventually document this behaviour, in which case compilers will be able to use it for a more efficient countr_zero / countl_zero without BMI1. Intel did recently document that AVX implied 16-byte aligned loads / stores were atomic on Intel CPUs, so it's not unprecedented for a vendor to document something that their CPUs have been doing for years.

If performance differences count, there are many (see links in the x86 tag wiki)!

You're not just talking about unsupported instructions, are you? Like LAHF/SAHF being unsupported in long mode on some very early x86-64 CPUs? Or CMPXCHG16B also unsupported on early K8.

The most interesting case of unsupported instructions is that LZCNT decodes as BSR on CPUs that don't support it, the REP prefix being ignored. Even for non-zero inputs, they return opposite results. (_lzcnt_u32(x) == 31-bsr(x)). TZCNT similarly decodes as (REP) BSF on CPUs that don't support it, but they do the same thing except when input = 0. I didn't mention this earlier, because running the same machine-code differently is not the same thing as running the same instruction differently, but it sounds like this is the kind of thing you're asking for.

Are we talking only about un-privileged instructions? There are probably many more differences in the behaviour of privileged instructions. For example, Intel and AMD both have different bugs in SYSRET that Linux has to work around to avoid malicious user-space being able to cause a kernel hang.

Another case that I'm not sure counts: PREFETCHW runs on Intel CPUs from at least Core2 to Haswell as a NOP, but on AMD CPUs (and Intel since Broadwell) as an actual prefetch.

So some CPUs run it as a NOP, some run it as a prefetch (thus no architecturally visible effect either way), except on ancient CPUs where it faults as an illegal insn. 64-bit Windows8.1 apparently requires that PREFETCHW can run without faulting (which stops it from running on (some?) 64-bit Pentium4 CPUs).

Latham answered 16/11, 2016 at 0:12 Comment(6)

Aside from the obvious cpuid, you could chip in that tzcnt executes as rep bsf and lzcnt as rep bsr on CPUs < Haswell because f3 in f3 0f bc /r is interpreted as rep on those CPUs, thus behaving architecturally differently. Those are the three instances I know of architecturally-different behaviour on x64; The co-opting of f2/f3 for the xacquire/xrelease HLE prefixes arguably also counts. – Sebrinasebum 16/11, 2016 at 1:8

I think this is the kind of instructions he was refering to, we were discussing DRMs, having code that runs a certain way only on specific CPU models is one way to prevent sharing the unpacked binaries. – Alkyne 16/11, 2016 at 1:37

About bsf/bsr, AMD seems to actually document that the register is unchanged if the input is zero, i.e., aligned with the actual Intel behavior. At least according to this source. – Putdown 16/11, 2016 at 2:2

@IwillnotexistIdonotexist: I didn't realize the OP was interested in the same machine code decoding differently on different CPUs, when the question only talked about the same instruction having different behaviour. There are CPUID flags you can check for LZCNT and BMI1 to let you detect whether you'll get REP BS* or *CNT. – Latham 16/11, 2016 at 2:54

isn't LZCNT == 31 - BSR instead of 32-BSR? – Jural 16/11, 2016 at 3:16

@LưuVĩnhPhúc: thanks, I didn't bother to check or think it through carefully, so I'm not surprised I got it wrong :P Fixed now, since wrong information is a Bad Thing. – Latham 16/11, 2016 at 3:19

There are a lot of differences between Intel and AMD

Intel 64's BSF and BSR instructions act differently than AMD64's when the source is zero and the operand size is 32 bits. The processor sets the zero flag and leaves the upper 32 bits of the destination undefined.

Intel 64 lacks some MSRs that are considered architectural in AMD64. These include SYSCFG, TOP_MEM, and TOP_MEM2.

Intel 64 allows SYSCALL/SYSRET only in 64-bit mode (not in compatibility mode),[33] and allows SYSENTER/SYSEXIT in both modes.[34] AMD64 lacks SYSENTER/SYSEXIT in both sub-modes of long mode.[35]

In 64-bit mode, near branches with the 66H (operand size override) prefix behave differently. Intel 64 ignores this prefix: the instruction has 32-bit sign extended offset, and instruction pointer is not truncated. AMD64 uses 16-bit offset field in the instruction, and clears the top 48 bits of instruction pointer.

AMD processors raise a floating point Invalid Exception when performing an FLD or FSTP of an 80-bit signalling NaN, while Intel processors do not.

When returning to a non-canonical address using SYSRET, AMD64 processors execute the general protection fault handler in privilege level 3, while on Intel 64 processors it is executed in privilege level 0.[38][39]

https://en.wikipedia.org/wiki/X86-64#Differences_between_AMD64_and_Intel_64

Jural answered 16/11, 2016 at 2:59 Comment(0)

Recommended topics

Hot tags