All the following instructions do the same thing: set %eax
to zero. Which way is optimal (requiring fewest machine cycles)?
xorl %eax, %eax
mov $0, %eax
andl $0, %eax
All the following instructions do the same thing: set %eax
to zero. Which way is optimal (requiring fewest machine cycles)?
xorl %eax, %eax
mov $0, %eax
andl $0, %eax
TL;DR summary: xor same, same
is the best choice for all CPUs. No other method has any advantage over it, and it has at least some advantage over any other method. It's officially recommended by Intel and AMD, and what compilers do. In 64-bit mode, still use xor r32, r32
, because writing a 32-bit reg zeros the upper 32. xor r64, r64
is a waste of a byte, because it needs a REX prefix.
Even worse than that, Silvermont only recognizes xor r32,r32
as dep-breaking, not 64-bit operand-size. Thus even when a REX prefix is still required because you're zeroing r8..r15, use xor r10d,r10d
, not xor r10,r10
.
GP-integer examples:
xor eax, eax ; RAX = 0. Including AL=0 etc.
xor r10d, r10d ; R10 = 0. Still prefer 32-bit operand-size.
xor edx, edx ; RDX = 0
; small code-size alternative: cdq ; zero RDX if EAX is already zero
; SUB-OPTIMAL
xor rax,rax ; waste of a REX prefix, and extra slow on Silvermont
xor r10,r10 ; bad on Silvermont (not dep breaking), same as r10d on other CPUs because a REX prefix is still needed for r10d or r10.
mov eax, 0 ; doesn't touch FLAGS, but not faster and takes more bytes
and eax, 0 ; false dependency. (Microbenchmark experiments might want this)
sub eax, eax ; same as xor on most but not all CPUs; bad on Silvermont for example.
xor cl, cl ; false dep on some CPUs, not a zeroing idiom. Use xor ecx,ecx
mov cl, 0 ; only 2 bytes, and probably better than xor cl,cl *if* you need to leave the rest of ECX/RCX unmodified
Zeroing a vector register is usually best done with pxor xmm, xmm
. That's typically what gcc does (even before use with FP instructions).
xorps xmm, xmm
can make sense. It's one byte shorter than pxor
, but xorps
needs execution port 5 on Intel Nehalem, while pxor
can run on any port (0/1/5). (Nehalem's 2c bypass delay latency between integer and FP is usually not relevant, because out-of-order execution can typically hide it at the start of a new dependency chain).
On SnB-family microarchitectures, neither flavour of xor-zeroing even needs an execution port. On AMD, and pre-Nehalem P6/Core2 Intel, xorps
and pxor
are handled the same way (as vector-integer instructions).
Using the AVX version of a 128b vector instruction zeros the upper part of the reg as well, so vpxor xmm, xmm, xmm
is a good choice for zeroing YMM(AVX1/AVX2) or ZMM(AVX512), or any future vector extension. vpxor ymm, ymm, ymm
doesn't take any extra bytes to encode, though, and runs the same on Intel, but slower on AMD before Zen2 (2 uops). The AVX512 ZMM zeroing would require extra bytes (for the EVEX prefix), so XMM or YMM zeroing should be preferred.
XMM/YMM/ZMM examples
# Good:
xorps xmm0, xmm0 ; smallest code size (for non-AVX)
pxor xmm0, xmm0 ; costs an extra byte, runs on any port on Nehalem.
xorps xmm15, xmm15 ; Needs a REX prefix but that's unavoidable if you need to use high registers without AVX. Code-size is the only penalty.
# Good with AVX:
vpxor xmm0, xmm0, xmm0 ; zeros X/Y/ZMM0
vpxor xmm15, xmm0, xmm0 ; zeros X/Y/ZMM15, still only 2-byte VEX prefix
#sub-optimal AVX
vpxor xmm15, xmm15, xmm15 ; 3-byte VEX prefix because of high source reg
vpxor ymm0, ymm0, ymm0 ; decodes to 2 uops on AMD before Zen2
# Good with AVX512
vpxor xmm15, xmm0, xmm0 ; zero ZMM15 using an AVX1-encoded instruction (2-byte VEX prefix).
vpxord xmm30, xmm30, xmm30 ; EVEX is unavoidable when zeroing zmm16..31, but still prefer XMM or YMM for fewer uops on probable future AMD. May be worth using only high regs to avoid needing vzeroupper in short functions.
# Good with AVX512 *without* AVX512VL (e.g. KNL / Xeon Phi)
vpxord zmm30, zmm30, zmm30 ; Without AVX512VL you have to use a 512-bit instruction.
# sub-optimal with AVX512 (even without AVX512VL)
vpxord zmm0, zmm0, zmm0 ; EVEX prefix (4 bytes), and a 512-bit uop. Use AVX1 vpxor xmm0, xmm0, xmm0 even on KNL to save code size.
See Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm? and
What is the most efficient way to clear a single or a few ZMM registers on Knights Landing?
Semi-related: Fastest way to set __m256 value to all ONE bits and
Set all bits in CPU register to 1 efficiently also covers AVX512 k0..7
mask registers. SSE/AVX vpcmpeqd
is dep-breaking on many (although still needs a uop to write the 1s), but AVX512 vpternlogd
for ZMM regs isn't even dep-breaking. Inside a loop consider copying from another register instead of re-creating ones with an ALU uop, especially with AVX512.
But zeroing is cheap: xor-zeroing an xmm reg inside a loop is usually as good as copying, except on some AMD CPUs (Bulldozer and Zen) which have mov-elimination for vector regs but still need an ALU uop to write zeros for xor-zeroing.
Some CPUs recognize sub same,same
as a zeroing idiom like xor
, but all CPUs that recognize any zeroing idioms recognize xor
. Just use xor
so you don't have to worry about which CPU recognizes which zeroing idiom.
xor
(being a recognized zeroing idiom, unlike mov reg, 0
) has some obvious and some subtle advantages (summary list, then I'll expand on those):
mov reg,0
. (All CPUs)Smaller machine-code size (2 bytes instead of 5) is always an advantage: Higher code density leads to fewer instruction-cache misses, and better instruction fetch and potentially decode bandwidth.
The benefit of not using an execution unit for xor on Intel SnB-family microarchitectures is minor, but saves power. It's more likely to matter on SnB or IvB, which only have 3 ALU execution ports. Haswell and later have 4 execution ports that can handle integer ALU instructions, including mov r32, imm32
, so with perfect decision-making by the scheduler (which doesn't always happen in practice), HSW could still sustain 4 uops per clock even when they all need ALU execution ports.
See my answer on another question about zeroing registers for some more details.
Bruce Dawson's blog post that Michael Petch linked (in a comment on the question) points out that xor
is handled at the register-rename stage without needing an execution unit (zero uops in the unfused domain), but missed the fact that it's still one uop in the fused domain. Modern Intel CPUs can issue & retire 4 fused-domain uops per clock. That's where the 4 zeros per clock limit comes from. Increased complexity of the register renaming hardware is only one of the reasons for limiting the width of the design to 4. (Bruce has written some very excellent blog posts, like his series on FP math and x87 / SSE / rounding issues, which I do highly recommend).
On AMD Bulldozer-family CPUs, mov immediate
runs on the same EX0/EX1 integer execution ports as xor
. mov reg,reg
can also run on AGU0/1, but that's only for register copying, not for setting from immediates. So AFAIK, on AMD the only advantage to xor
over mov
is the shorter encoding. It might also save physical register resources, but I haven't seen any tests.
Recognized zeroing idioms avoid partial-register penalties on Intel CPUs which rename partial registers separately from full registers (P6 & SnB families).
xor
will tag the register as having the upper parts zeroed, so xor eax, eax
/ inc al
/ inc eax
avoids the usual partial-register penalty that pre-IvB CPUs have. Even without xor
, IvB and later only needs a merging uop when the high 8bits (AH
) are modified and then the whole register is read. (Agner incorrectly states that Haswell removes AH merging penalties.)
From Agner Fog's microarch guide, pg 98 (Pentium M section, referenced by later sections including SnB):
The processor recognizes the XOR of a register with itself as setting it to zero. A special tag in the register remembers that the high part of the register is zero so that EAX = AL. This tag is remembered even in a loop:
; Example 7.9. Partial register problem avoided in loop
xor eax, eax
mov ecx, 100
LL:
mov al, [esi]
mov [edi], eax ; No extra uop
inc esi
add edi, 4
dec ecx
jnz LL
(from pg82): The processor remembers that the upper 24 bits of EAX are zero as long as you don't get an interrupt, misprediction, or other serializing event.
pg82 of that guide also confirms that mov reg, 0
is not recognized as a zeroing idiom, at least on early P6 designs like PIII or PM. I'd be very surprised if they spent transistors on detecting it on later CPUs.
xor
sets flags, which means you have to be careful when testing conditions. Since setcc
is unfortunately only available with an 8-bit destination (until APX extension1), you usually need to take care to avoid partial-register penalties.
It would have been nice if x86-64 repurposed one of the removed opcodes (like AAM) for a 16/32/64 bit setcc r/m
, with the predicate encoded in the source-register 3-bit field of the r/m field (the way some other single-operand instructions use them as opcode bits). But they didn't do that, and that wouldn't help for x86-32 anyway.
Ideally, you should use xor
/ set flags / setcc
/ read full register:
...
call some_func
xor ecx,ecx ; zero *before* setting FLAGS
cmp eax, 42
setnz cl ; ecx = cl = (some_func() != 42)
add ebx, ecx ; no partial-register penalty here
This has optimal performance on all CPUs (no stalls, merging uops, or false dependencies). (If the condition was ebx += (eax != 0)
, there are tricks like cmp eax, 1; sbb ebx, -1
using the carry flag with adc
or sbb
to add or subtract it directly, instead of materializing it as a 0/1 integer, as @l4m2 pointed out in comments. It might even be worth it to do sub eax, 42
(or LEA into another reg) / cmp eax,1
/ sbb
. Especially if it's hard to arrange to xor-zero before setting FLAGS, since cmp
/setcc
/movzx
/add
has all 4 operations on the critical path for latency.)
Things are more complicated when you don't want to xor before a flag-setting instruction. e.g. you want to branch on one condition and then setcc on another condition from the same flags. e.g. cmp/jle
, sete
, and you either don't have a spare register, or you want to keep the xor
out of the not-taken code path altogether.
There are no recognized zeroing idioms that don't affect flags, so the best choice depends on the target microarchitecture. On Core2, inserting a merging uop might cause a 2 or 3 cycle stall. It's cheaper on SnB, like 1 cycle at worst, and Haswell and later don't rename partial registers separately from full regs. Using mov reg, 0
/ setcc
is probably best on recent CPUs, but would have a significant penalty on older Intel CPUs (Nehalem and earlier). On newer CPUs it's close to as good as xor-zeroing, but has worse code-size than movzx
.
Using setcc
/ movzx r32, r8
is probably the best alternative for Intel P6, if you can't xor-zero ahead of the flag-setting instruction. That should be better than repeating the test after an xor-zeroing. (Don't even consider sahf
/ lahf
or pushf
/ popf
). IvB and later (except for Ice Lake) can eliminate movzx r32, r8
(i.e. handle it with register-renaming with no execution unit or latency, like xor-zeroing). AMD Zen family can only eliminate regular mov
instructions, so movzx
takes an execution unit and has non-zero latency, making test/setcc
/movzx
worse than xor
/test/setcc
.
Also worse than test/mov r,0
/setcc
(but much better on older Intel CPUs with partial-register stalls).
Using setcc
/ movzx
with no zeroing first is bad on AMD/P4/Silvermont, because they don't track deps separately for sub-registers. There would be a false dep on the old value of the register. Using mov reg, 0
/setcc
for zeroing / dependency-breaking is probably the best alternative when xor
/test/setcc
isn't an option. At least for "hot" code where this is part of an important latency chain. Otherwise go for movzx
to save a bit of code size.
Of course, if you don't need setcc
's output to be wider than 8 bits, you don't need to zero anything. However, beware of false dependencies on CPUs other than P6 / SnB if you pick a register that was recently part of a long dependency chain. (And beware of causing a partial reg stall or extra uop if you call a function that might save/restore the register you're using part of.)
and
with an immediate zero isn't special-cased as independent of the old value on any CPUs I'm aware of, so it doesn't break dependency chains. It has no advantages over xor
and many disadvantages.
It's useful only for writing microbenchmarks when you want a dependency as part of a latency test, but want to create a known value by zeroing and adding.
See http://agner.org/optimize/ for microarch details, including which zeroing idioms are recognized as dependency breaking (e.g. sub same,same
is on some but not all CPUs, while xor same,same
is recognized on all.) mov
does break the dependency chain on the old value of the register (regardless of the source value, zero or not, because that's how mov
works). xor
only breaks dependency chains in the special-case where src and dest are the same register, which is why mov
is left out of the list of specially recognized dependency-breakers. (Also, because it's not recognized as a zeroing idiom, with the other benefits that carries.)
Interestingly, the oldest P6 design (PPro through Pentium III) didn't recognize xor
-zeroing as a dependency-breaker, only as a zeroing idiom for the purposes of avoiding partial-register stalls, so in some cases it was worth using both mov
and then xor
-zeroing in that order to break the dep and then zero again + set the internal tag bit that the high bits are zero so EAX=AX=AL.
See Agner Fog's Example 6.17. in his microarch pdf. He says this also applies to P2, P3, and even (early?) PM. A comment on the linked blog post says it was only PPro that had this oversight, but I've tested on Katmai PIII, and @Fanael tested on a Pentium M, and we both found that it didn't break a dependency for a latency-bound imul
chain. This confirms Agner Fog's results, unfortunately.
Footnote 1: Intel Advanced Performance Extensions (APX) introduces REX2 and EVEX forms of integer instructions, for 32 GPRs and new 3-operand forms of common instructions. And finally a zero-extending ("zero-upper" aka ZU) form of setcc r64
. (Total instruction length of 6 bytes, using one of the spare bits in the EVEX prefix to encode legacy vs. zero-upper behaviour for register destinations.)
If it really makes your code nicer or saves instructions, then sure, zero with mov
to avoid touching the flags, as long as you don't introduce a performance problem other than code size. Avoiding clobbering flags is the only sensible reason for not using xor
, but sometimes you can xor-zero ahead of the thing that sets flags if you have a spare register.
mov
-zero ahead of setcc
is better for latency than movzx reg32, reg8
after (except on Intel when you can pick different registers), but worse code size.
mov reg, src
also breaks dep chains for OO CPUs (regardless of src being imm32, [mem]
, or another register). This dependency-breaking doesn't get mentioned in optimization manuals because it's not a special case that only happens when src and dest are the same register. It always happens for instructions that don't depend on their dest. (except for Intel's implementation of popcnt/lzcnt/tzcnt
having a false dep on the dest.) –
Frogfish mov
free, only zero latency. The "not taking an execution port" part usually isn't important. Fused-domain throughput can easily be the bottleneck, esp. with loads or stores in the mix. –
Frogfish xor r64, r64
does not just waste a byte. As you say xor r32, r32
is the best choice especially with KNL. See section 15.7 "Special cases of independence" in this micrarch manual if you want to read more. –
Eusporangiate xor r8d,r8d
). But then I got side-tracked setting up my new Skylake i7-6700k desktop with 16G of DDR4-2666 RAM :) Bit of an upgrade from 65nm Core2Duo, e.g. 8x faster video encoding with x264 (which has good tuning/optimization for Core2, unlike x265). I'll get back to that edit Real Soon Now, since I still have the text saved. –
Frogfish gcc
somewhat stupidly issuing a mov reg, 0
rather than an xor
for a simple function. Sure, it is probably doing that because it needs the flags preserved from the earlier cmp
, but it could have just swapped the order! clang
does fine, and icc
also uses an xor
but only gets part marks because it pointlessly includes a mov esi, esi
in the critical path. –
Buncombe cmp
can't be deferred. (e.g. if it wanted to result in esi
). In other cases of gcc making worse code (like setcc/movzx
instead of xor/setcc
), it usually looks like an idiom designed to reduce register pressure, used even when there isn't any. –
Frogfish setcc/movzx
as a single thing that stays together, vs. having to add stuff to the function's internal representation to express "ok, we need an xor-zeroed register before the flag-setting", and maybe have a fallback in case that's hard to do. (Although in most cases you'd expect it would just end up redoing a test
or cmp
). –
Frogfish mov
-immediate or a regular XOR, and probably less than any other instruction other than pause
. But still much more than sitting in low-power sleep. Modern digital logic is very far from the information-theoretic limits of energy per computation, and nothing they do internally has negative cost. –
Frogfish setcc
what does that have to do with zeroing? How does that add to the xor
-idiom? –
Ranchero imul eax, eax/xor eax, eax
, with the reasoning that if xor
is dep-breaking then the loop will be throughput bound and latency bound if it's not… –
Introspection xor
is dependency breaking. So it's clear that Agner is correct here in that xor
is not dependency breaking on Pentium II/III and Pentium M. It may have changed in Yonah, the last generation of Pentium M sold as Core Solo and Core Duo (note, not Core 2), but I don't have that hardware to test. –
Introspection xor
-zeroing. Is it mov
with immediate? And would you do them in the order mov reg, 0
\ xor reg, reg
, or the other way around? –
Excoriate mov reg, 0
, then zero it again and set the internal EAX = AX = AL tag bit (avoiding partial-register stalls) with xor reg,reg
. i.e. the "upper bits zeroed" tag. IDK why xor-zeroing didn't break the dependency; maybe recognizing that in the decoders or issue stage would have taken extra logic vs. handling it only in the boolean logic execution unit? But AFAIK it only worked with xor same,same
, not just any xor that happened to produce 0
, and the ALU doesn't know if its inputs came from a reg, mem, or immed. –
Frogfish ebx
by 1 iff eax
is not zero, is mov ecx, eax; add ecx, -1; adc ebx, 0
shorter and faster? –
Ewen cmp eax, 1; sbb ebx, -1
enough –
Ewen x == 4
or x & 0x88
that doesn't allow optimization to get the condition in CF for adc
or sbb
–
Frogfish and mem, dword 0
is 3 bytes shorter than mov mem, dword 0
and 1 byte shorter than xor eax, eax/mov mem, eax
, which may be useful to fit in small area –
Ewen and dword mem, 0
is also slower, with a false dependency on the old value of memory, so usually only useful for code-size optimization at the expense of speed. Post in Tips for golfing in x86 machine code if it's not there already. (Coding style: I think it makes a lot more sense to put the size override on the memory operand, not the immediate, especially to distinguish from things like and eax, strict dword 0
to request an imm32 encoding.) –
Frogfish setcc r64
, right? –
Excoriate setcc r/m16
wouldn't be very valuable, so APX providing setcc.zu full_reg
via an EVEX for the same opcodes solves the same problem for the only important case, at the cost of larger code-size than my proposed way. It would've been nice if the REX2 encoding defaulted to zero-extend so a full 4-byte EVEX prefix would only be needed to not do that w. r16-r31. –
Frogfish © 2022 - 2024 — McMap. All rights reserved.