Test whether a register is zero with CMP reg,0 vs OR reg,reg?
Asked Answered
A

2

23

Is there any execution speed difference using the following code:

cmp al, 0
je done

and the following:

or al, al
jz done

I know that the JE and JZ instructions are the same, and also that using OR gives a size improvement of one byte. However, I am also concerned with code speed. It seems that logical operators will be faster than a SUB or a CMP, but I just wanted to make sure. This might be a trade-off between size and speed, or a win-win (of course the code will be more opaque).

Accommodative answered 15/11, 2015 at 15:8 Comment(2)
The intel optimization manual says: Use a TEST of a register with itself instead of a CMP of the register to zero, this saves the need to encode the zero, so that's pretty much only the size. Macro-op fusion also applies to both. A quick glance into the Agner Fog tables suggests speed same for CMP and OR for most cpus.Circumscissile
@Jester: OR can't macro-fuse with anything. Older CPUs (Core2) can only macro-fuse signed-comparisons with test, but not cmp. AMD CPUs can only macro-fuse cmp and test, never an op that also writes a register.Spearing
E
16

It depends on the exact code sequence, which specific CPU it is, and other factors.

The main problem with or al, al, is that it "modifies" EAX, which means that a subsequent instruction that uses EAX in some way may stall until this instruction completes. Note that the conditional branch (jz) also depends on the instruction, but CPU manufacturers do a lot of work (branch prediction and speculative execution) to mitigate that. Also note that in theory it would be possible for a CPU manufacturer to design a CPU that recognises EAX isn't changed in this specific case, but there are hundreds of these special cases and the benefits of recognising most of them are too little.

The main problem with cmp al,0 is that it's slightly larger, which might mean slower instruction fetch/more cache pressure, and (if it is a loop) might mean that the code no longer fits in some CPU's "loop buffer".

As Jester pointed out in comments; test al,al avoids both problems - it's smaller than cmp al,0 and doesn't modify EAX.

Of course (depending on the specific sequence) the value in AL must've come from somewhere, and if it came from an instruction that set flags appropriately it might be possible to modify the code to avoid using another instruction to set flags again later.

Extrapolate answered 15/11, 2015 at 16:37 Comment(8)
The value in AL comes from a BIOS interrupt, so that doesn't qualify as 'setting flags appropriately'... iret would restore flags anyway. I also had in mind a print subroutine that used lodsb, and checked for a null terminator, does lodsb alter flags based on what is in AL?Accommodative
@AnonymousShadow In that context the performance of your comparison instruction is insignificant and you shouldn't worry about it. A BIOS interrupt will take hundreds of cycles at minimum, up to billions of cycles for a slow I/O operation.Ampoule
@RossRidge what about using LODSB with a huge string? makes a difference size-wise anyway, might as well use it.Accommodative
@AnonymousShadow: Use lodsb if optimizing for code size. Otherwise, mov al, [esi] / inc esi decodes to only 2 uops instead of 3 on Intel CPUs (e.g. Haswell), so it potentially runs faster. Depending on your loop, you might be able to avoid the pointer increment with a more complex addressing mode (smaller code size, but 2-register addressing modes can't micro-fuse on Intel SnB-family). See my answer for why test is better for the same reason (fewer uops thanks to macro-fusion with a branch). If you're using setcc to consume the flags, rather than a branch, it's less important.Spearing
@Extrapolate Both test al,al and cmp al,0 occupy 2 bytes. It's only when you start using another register that the sizes differ.Momus
Update on my previous comment: mov al, [esi + edx] is always 1 uop. Pure loads don't need to micro-fuse. Indexed addressing modes are only an issue for stores or memory operands to ALU instructions.Spearing
update to previous comment #2: mov al, [esi + edx] isn't a pure load on Haswell and later: AL isn't renamed separate from EAX/RAX, so it takes an ALU uop to merge into the full register. But Haswell can keep that load+merge micro-fused. movzx eax, [esi + edx] is a pure load and is generally preferable for byte loads unless you're optimizing for code-size over speed.Spearing
@PeterCordes In your last example in this comment your movzx instruction should specify the source size with a byte keyword.Isodynamic
S
39

Yes, there is a difference in performance.

The best choice for comparing a register with zero is test reg, reg. It sets FLAGS the same way cmp reg,0 would, and is at least as fast1 as any other way, with smaller code-size.

(Even better is when ZF is already set appropriately by the instruction that set reg so you can just branch, setcc, or cmovcc directly. For example, the bottom of a normal loop often looks like dec ecx / jnz .loop_top. Most x86 integer instructions "set flags according to the result", including ZF=1 if the output was 0.).

or reg,reg can't macro-fuse with a JCC into a single uop on any existing x86 CPUs, and adds latency for anything that later reads reg because it rewrites the value into the register. cmp's downside is usually just code-size.


The FLAGS results of test reg,reg / and reg,reg / or reg,reg are
identical to cmp reg, 0 in all cases (except for AF) because:

  • CF = OF = 0 because test/and always do that, and for cmp because subtracting zero can't overflow or carry.
  • ZF, SF, PF set according to the result (i.e. reg): reg&reg for test, or reg - 0 for cmp.

(AF is undefined after test, but set according to the result for cmp. I'm ignoring it because it's really obscure: the only instructions that read AF are the ASCII-adjust packed-BCD instructions like AAS, and lahf / pushf.)

You can of course check conditions other than reg == 0 (ZF), e.g. test for negative signed integers by looking at SF. But fun fact: jl, the signed less-than condition, is more efficient than js on some CPUs after a cmp. They're equivalent after compare against zero because OF=0 so the l condition (SF!=OF) is equivalent to SF.

Every CPU that can macro-fuse TEST/JL can also macro-fuse TEST/JS, even Core 2. But after CMP byte [mem], 0, always use JL not JS to branch on the sign bit because Core 2 can't macro-fuse that. (At least in 32-bit mode; Core 2 can't macro-fuse at all in 64-bit mode).

The signed-compare conditions also let you do stuff like jle or jg, looking at ZF as well as SF!=OF.


test is shorter to encode than cmp with immediate 0, in all cases except the cmp al, imm8 special case which is still two bytes.

Even then, test is preferable for macro-fusion reasons (with jle and similar on Core2), and because having no immediate at all can possibly help uop-cache density by leaving a slot that another instruction can borrow if it needs more space (SnB-family).


Macro-fusion of test/jcc into a single uop in the decoders

The decoders in Intel and AMD CPUs can internally macro-fuse test and cmp with some conditional branch instructions into a single compare-and-branch operation. This gives you a max throughput of 5 instructions per cycle when macro-fusion happens, vs. 4 without macro-fusion. (For Intel CPUs since Core2.)

Recent Intel CPUs can macro-fuse some instructions (like and and add/sub) as well as test and cmp, but or is not one of them. AMD CPUs can only merge test and cmp with a JCC. See x86_64 - Assembly - loop conditions and out of order, or just refer directly to Agner Fog's microarch docs for the details of which CPU can macro-fuse what. test can macro-fuse in some cases where cmp can't, e.g. with js.

Almost all simple ALU ops (bitwise boolean, add/sub, etc.) run in a single cycle. They all have the same "cost" in tracking them through the out-of-order execution pipeline. Intel and AMD spend the transistors to make fast execution units to add/sub/whatever in a single cycle. Yes, bitwise OR or AND is simpler, and probably uses slightly less power, but still can't run any faster than one clock cycle.


or reg, reg adds another cycle of latency to the dependency chain for following instructions that need to read the register. It's an x |= x in the chain of operations that lead to the value you want later.

You might think that extra register write would also need an extra physical register-file (PRF) entry vs. test, but that's probably not the case. (See https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ for more about PRF capacity impact on out-of-order exec).

test has to produce its FLAGS output somewhere. On Intel Sandybridge-family CPUs at least, when an instruction produces a register and a FLAGS result, both of them are stored together in the same PRF entry. (Source: an Intel patent I think. This is from memory but seems like an obviously sane design since most x86 ALU instructions write FLAGs.)

An instruction like cmp or test that only produces a FLAGS result also needs a PRF entry for its output. Presumably this is slightly worse: the old physical register is still "alive", referenced as the holder of the value of the architectural register written by some older instruction. And now architectural EFLAGS (actually both the separately-renamed CF and SPAZO flag groups) point to this new physical register in the RAT (register allocation table) updated by the renamer. Of course, the next FLAGS-writing instruction will overwrite that, allowing that PR to be freed once all its readers have read it and executed. This is not something I think about when optimizing, and I don't think tends to matter in practice.


Footnote 1: P6-family register-read stalls: possible upside to or reg,reg

This only applies to obsolete P6-family CPUs (Intel up to Nehalem, replaced by Sandybridge-family in 2011). On other CPUs, there's no benefit to rewriting a register with itself with or same,same instead of just reading it with test same,same.

P6-family CPUs (PPro / PII to Nehalem) have a limited number of register-read ports for the issue/rename stage to read "cold" values (not forwarded from an in-flight instruction) from the permanent register file, but recently-written values are available directly from the ROB. Rewriting a register unnecessarily can make it live in the forwarding network again to help avoid register-read stalls. (See Agner Fog's microarch pdf).

Re-writing a register with the same value on purpose to keep it "hot" can actually be an optimization for some cases of surrounding code, on P6. Early P6 family CPUs couldn't do macro-fusion at all, so you aren't even missing out on that by using and reg,reg instead of test. But Core 2 (in 32-bit mode) and Nehalem (in any mode) can macro-fuse test/jcc so you're missing out on that.

(and is equivalent to or for this purpose on P6-family, but less bad if your code ever runs on a Sandybridge-family CPU: it can macro-fuse and/jcc but not or/jcc. The extra cycle of latency in the dep-chain for the register is still a disadvantage on P6, especially if the critical path involving it is the main bottleneck.)

P6 family is very much obsolete these days (Sandybridge replaced it in 2011), and CPUs before Core 2 (Core, Pentium M, PIII, PII, PPro) are very obsolete and getting into retrocomputing territory, especially for anything where performance matters. You can ignore P6-family when optimizing unless you have a specific target machine in mind (e.g. if you have a crusty old Nehalem Xeon machine) or you're tuning a compiler's -mtune=nehalem settings for the few users still left.

If you're tuning something to be fast on Core 2 / Nehalem, use test unless profiling shows that register-read stalls are a big problem in a specific case, and using and actually fixes it.

On earlier P6-family, and reg,reg might be ok as your default code-gen choice when the value isn't part of a problematic loop-carried dep chain, but is read later. Or if it is, but there's also a specific register-read stall that you can fix with and reg,reg.

If you only want to test the low 8 bits of a full register, test al,al avoids writing a partial-register, which on P6-family is renamed separately from the full EAX/RAX. or al,al is much worse if you later read EAX or AX: partial-register stall on P6-family.(Why doesn't GCC use partial registers?)


History of the unfortunate or reg,reg idiom

The or reg,reg idiom may have came from 8080 ORA A, as pointed out in a comment.

8080's instruction set doesn't have a test instruction, so your choices for setting flags according to a value included ORA A and ANA A. (Notice that the A register destination is baked in to the mnemonic for both those instructions, and there aren't instructions to OR into different registers: it's a 1-address machine except for mov, while 8086 is a 2-address machine for most instructions.)

8080 ORA A was the usual go-to way to do it, so presumably that habit carried over into 8086 assembly programming as people ported their asm sources. (Or used automatic tools; 8086 was intentionally designed for easy / automatic asm-source porting from 8080 code.)

This bad idiom continues to be blindly used by beginners, presumably taught by people who learned it back in the day and passed it on without thinking about the obvious critical path latency downside for out-of-order execution. (Or the other more subtle problems like no macro-fusion.)


Delphi's compiler reportedly uses or eax,eax, which was maybe a reasonable choice at the time (before Core 2), assuming that register-read stalls were more important than lengthening the dep chain for whatever reads it next. IDK if that's true or they were just using the ancient idiom without thinking about it.

Unfortunately, compiler-writers at the time didn't know the future, because and eax,eax performs exactly equivalently to or eax,eax on Intel P6-family, but is less bad on other uarches because and can macro-fuse on Sandybridge-family. (See the P6 section above).


Value in memory: maybe use cmp or load it into a reg.

To test a value in memory, you can cmp dword [mem], 0, but Intel CPUs can't macro-fuse flag-setting instructions that have both an immediate and a memory operand. If you're going to use the value after the compare in one side of the branch, you should mov eax, [mem] / test eax,eax or something. If not, either way is 2 front-end uops, but it's a tradeoff between code-size and back-end uop count.

Although note that some addressing modes won't micro-fuse either on SnB-family: RIP-relative + immediate won't micro-fuse in the decoders, or an indexed addressing mode will un-laminate after the uop-cache. Either way leading to 3 fused-domain uops for cmp dword [rsi + rcx*4], 0 / jne or [rel some_static_location].

On i7-6700k Skylake (tested with perf events uops_issued.any and uops_executed.thread):

  • mov reg, [mem] (or movzx) + test reg,reg / jnz 2 uops in both fused and unfused domains, regardless of addressing mode, or movzx instead of mov. Nothing to micro-fuse; does macro-fuse.
  • cmp byte [rip+static_var], 0 + jne. 3 fused, 3 unfused. (front and back ends). The RIP-relative + immediate combination prevents micro-fusion. It also doesn't macro-fuse. Smaller code-size but less efficient.
  • cmp byte [rsi + rdi], 0 (indexed addr mode) / jne 3 fused, 3 unfused. Micro-fuses in the decoders, but un-laminates at issue/rename. Doesn't macro-fuse.
  • cmp byte [rdi + 16], 0 + jne 2 fused, 3 unfused uops. Micro-fusion of cmp load+ALU did happen because of the simple addressing mode, but the immediate prevents macro-fusion. About as good as load + test + jnz: smaller code-size but 1 extra back-end uop.

If you have a 0 in a register (or a 1 if you want to compare a bool), you can cmp [mem], reg / jne for even fewer uops, as low as 1 fused-domain, 2 unfused. But RIP-relative addressing modes still don't macro-fuse.

Compilers tend to use load + test/jcc even when the value isn't used later.

You could also test a value in memory with test dword [mem], -1, but don't. Since test r/m16/32/64, sign-extended-imm8 isn't available, it's worse code-size than cmp for anything larger than bytes. (I think the design idea was that if you only want to test the low bit of a register, just test cl, 1 instead of test ecx, 1, and use cases like test ecx, 0xfffffff0 are rare enough that it wasn't worth spending an opcode. Especially since that decision was made for 8086 with 16-bit code, where it was only the difference between an imm8 and imm16, not imm32.)

(I wrote -1 rather than 0xFFFFFFFF so it would be the same with byte or qword. ~0 would be another way to write it.)

Related:

Spearing answered 15/11, 2015 at 20:42 Comment(10)
I usually think in terms of number of micro-ops instead of instructions. A folded instruction is really two operations with two micro-ops (that count as one micro-op). On Haswell I did six micro-ops (or operations)/clock cycle but five instructions/cycle. I don't know what the maximum micro-ops/clock cycle are possible but it's at least six. I guess I mean the number of operations /cycle is more interesting. I'm not really disagreeing with anything you wrote.Erigena
@Zboson: I usually think in terms of fused-domain uops. I also consider execution ports when it's relevant, but if there are load/stores involved you're often limited by the frontend / pipeline width (4 uops / clock), not execution resources. (Assuming of course you're not limited by dep chains or cache misses.) I only pointed out instructions / clock as a way of explaining why getting macro-fusion to happen was important.Spearing
I think that the origins of OR AL,AL can be traced back to ORA A on the 8080. As the oldest part of the MSDOS API was modelled after that of CP/M to facilitate porting I can imagine lots of early DOS code was seriously influenced by code that started its existence on the 8080.Defilade
@fvu: Thanks! Yes, 8086 was specifically designed for easy asm source porting from 8080. I found this 8080 ISA ref and indeed, there's no TEST instruction, just ORA or ANA that are shorter than 2-byte CPI 0 to compare with an explicit immediate. Apparently most people picked OR instead of AND to set flags from A, probably because the mnemonic reads more easily.Spearing
@Zboson What documentation do you use to know which micro-ops to consider when thinking of a given instruction (or sequence)?Tweezers
@MikeB: uops.info is the best current source, with reliable automated testing. For older CPUs, Agner Fog's instruction tables are generally very good, and mostly free of typos... agner.org/optimize. For analyzing sequences of instructions, there's Intel's IACA (end-of-lifed) What is IACA and how do I use it?, and the open source LLVM-MCA llvm.org/docs/CommandGuide/llvm-mca.htmlSpearing
"Compilers tend to use load + test/jcc even when the )" -- This line is incomplete.Isodynamic
@ecm: thanks for proof reading! IIRC, I meant to say "even when the value isn't used later". Pesky ADHD, I bounced around a lot editing different parts of this answer instead of finishing a thought in one place :PSpearing
RE: Re-writing a register with the same value on purpose to keep it "hot" could you elaborate?Horlacher
@sarvel: I did, in the previous paragraph, which also linked Agner Fog's microarch guide for further details. He covers register-read stalls in various of the P6 chapters. The Core2/Nehalem version is agner.org/optimize/microarchitecture.pdf#page=112 . and eax,eax writes EAX, so later instructions that read EAX are reading the output of that instruction (which will be in the ROB for a few cycles at least), not whatever earlier instruction wrote EAX before that.Spearing
E
16

It depends on the exact code sequence, which specific CPU it is, and other factors.

The main problem with or al, al, is that it "modifies" EAX, which means that a subsequent instruction that uses EAX in some way may stall until this instruction completes. Note that the conditional branch (jz) also depends on the instruction, but CPU manufacturers do a lot of work (branch prediction and speculative execution) to mitigate that. Also note that in theory it would be possible for a CPU manufacturer to design a CPU that recognises EAX isn't changed in this specific case, but there are hundreds of these special cases and the benefits of recognising most of them are too little.

The main problem with cmp al,0 is that it's slightly larger, which might mean slower instruction fetch/more cache pressure, and (if it is a loop) might mean that the code no longer fits in some CPU's "loop buffer".

As Jester pointed out in comments; test al,al avoids both problems - it's smaller than cmp al,0 and doesn't modify EAX.

Of course (depending on the specific sequence) the value in AL must've come from somewhere, and if it came from an instruction that set flags appropriately it might be possible to modify the code to avoid using another instruction to set flags again later.

Extrapolate answered 15/11, 2015 at 16:37 Comment(8)
The value in AL comes from a BIOS interrupt, so that doesn't qualify as 'setting flags appropriately'... iret would restore flags anyway. I also had in mind a print subroutine that used lodsb, and checked for a null terminator, does lodsb alter flags based on what is in AL?Accommodative
@AnonymousShadow In that context the performance of your comparison instruction is insignificant and you shouldn't worry about it. A BIOS interrupt will take hundreds of cycles at minimum, up to billions of cycles for a slow I/O operation.Ampoule
@RossRidge what about using LODSB with a huge string? makes a difference size-wise anyway, might as well use it.Accommodative
@AnonymousShadow: Use lodsb if optimizing for code size. Otherwise, mov al, [esi] / inc esi decodes to only 2 uops instead of 3 on Intel CPUs (e.g. Haswell), so it potentially runs faster. Depending on your loop, you might be able to avoid the pointer increment with a more complex addressing mode (smaller code size, but 2-register addressing modes can't micro-fuse on Intel SnB-family). See my answer for why test is better for the same reason (fewer uops thanks to macro-fusion with a branch). If you're using setcc to consume the flags, rather than a branch, it's less important.Spearing
@Extrapolate Both test al,al and cmp al,0 occupy 2 bytes. It's only when you start using another register that the sizes differ.Momus
Update on my previous comment: mov al, [esi + edx] is always 1 uop. Pure loads don't need to micro-fuse. Indexed addressing modes are only an issue for stores or memory operands to ALU instructions.Spearing
update to previous comment #2: mov al, [esi + edx] isn't a pure load on Haswell and later: AL isn't renamed separate from EAX/RAX, so it takes an ALU uop to merge into the full register. But Haswell can keep that load+merge micro-fused. movzx eax, [esi + edx] is a pure load and is generally preferable for byte loads unless you're optimizing for code-size over speed.Spearing
@PeterCordes In your last example in this comment your movzx instruction should specify the source size with a byte keyword.Isodynamic

© 2022 - 2024 — McMap. All rights reserved.