How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent

Asked 13/8, 2017 at 12:5 Answered 3/5, 2019 at 3:56

Solved assembly x86 intel cpu-architecture micro-optimization

This loop runs at one iteration per 3 cycles on Intel Conroe/Merom, bottlenecked on imul throughput as expected. But on Haswell/Skylake, it runs at one iteration per 11 cycles, apparently because setnz al has a dependency on the last imul.

; synthetic micro-benchmark to test partial-register renaming
    mov     ecx, 1000000000
.loop:                 ; do{
    imul    eax, eax     ; a dep chain with high latency but also high throughput
    imul    eax, eax
    imul    eax, eax

    dec     ecx          ; set ZF, independent of old ZF.  (Use sub ecx,1 on Silvermont/KNL or P4)
    setnz   al           ; ****** Does this depend on RAX as well as ZF?
    movzx   eax, al
    jnz  .loop         ; }while(ecx);

If setnz al depends on rax, the 3ximul/setcc/movzx sequence forms a loop-carried dependency chain. If not, each setcc/movzx/3ximul chain is independent, forked off from the dec that updates the loop counter. The 11c per iteration measured on HSW/SKL is perfectly explained by a latency bottleneck: 3x3c(imul) + 1c(read-modify-write by setcc) + 1c(movzx within the same register).

Off topic: avoiding these (intentional) bottlenecks

I was going for understandable / predictable behaviour to isolate partial-reg stuff, not optimal performance.

For example, xor-zero / set-flags / setcc is better anyway (in this case, xor eax,eax / dec ecx / setnz al). That breaks the dep on eax on all CPUs (except early P6-family like PII and PIII), still avoids partial-register merging penalties, and saves 1c of movzx latency. It also uses one fewer ALU uop on CPUs that handle xor-zeroing in the register-rename stage. See that link for more about using xor-zeroing with setcc.

Note that AMD, Intel Silvermont/KNL, and P4, don't do partial-register renaming at all. It's only a feature in Intel P6-family CPUs and its descendant, Intel Sandybridge-family, but seems to be getting phased out.

gcc unfortunately does tend to use cmp / setcc al / movzx eax,al where it could have used xor instead of movzx (Godbolt compiler-explorer example), while clang uses xor-zero/cmp/setcc unless you combine multiple boolean conditions like count += (a==b) | (a==~b).

The xor/dec/setnz version runs at 3.0c per iteration on Skylake, Haswell, and Core2 (bottlenecked on imul throughput). xor-zeroing breaks the dependency on the old value of eax on all out-of-order CPUs other than PPro/PII/PIII/early-Pentium-M (where it still avoids partial-register merging penalties but doesn't break the dep). Agner Fog's microarch guide describes this. Replacing the xor-zeroing with mov eax,0 slows it down to one per 4.78 cycles on Core2: 2-3c stall (in the front-end?) to insert a partial-reg merging uop when imul reads eax after setnz al.

Also, I used movzx eax, al which defeats mov-elimination, just like mov rax,rax does. (IvB, HSW, and SKL can rename movzx eax, bl with 0 latency, but Core2 can't). This makes everything equal across Core2 / SKL, except for the partial-register behaviour.

The Core2 behaviour is consistent with Agner Fog's microarch guide, but the HSW/SKL behaviour isn't. From section 11.10 for Skylake, and same for previous Intel uarches:

Different parts of a general purpose register can be stored in different temporary registers in order to remove false dependences.

He unfortunately doesn't have time to do detailed testing for every new uarch to re-test assumptions, so this change in behaviour slipped through the cracks.

Agner does describe a merging uop being inserted (without stalling) for high8 registers (AH/BH/CH/DH) on Sandybridge through Skylake, and for low8/low16 on SnB. (I've unfortunately been spreading mis-information in the past, and saying that Haswell can merge AH for free. I skimmed Agner's Haswell section too quickly, and didn't notice the later paragraph about high8 registers. Let me know if you see my wrong comments on other posts, so I can delete them or add a correction. I will try to at least find and edit my answers where I've said this.)

My actual questions: How exactly do partial registers really behave on Skylake?

Is everything the same from IvyBridge to Skylake, including the high8 extra latency?

Intel's optimization manual is not specific about which CPUs have false dependencies for what (although it does mention that some CPUs have them), and leaves out things like reading AH/BH/CH/DH (high8 registers) adding extra latency even when they haven't been modified.

If there's any P6-family (Core2/Nehalem) behaviour that Agner Fog's microarch guide doesn't describe, that would be interesting too, but I should probably limit the scope of this question to just Skylake or Sandybridge-family.

My Skylake test data, from putting %rep 4 short sequences inside a small dec ebp/jnz loop that runs 100M or 1G iterations. I measured cycles with Linux perf the same way as in my answer here, on the same hardware (desktop Skylake i7 6700k).

Unless otherwise noted, each instruction runs as 1 fused-domain uop, using an ALU execution port. (Measured with ocperf.py stat -e ...,uops_issued.any,uops_executed.thread). This detects (absence of) mov-elimination and extra merging uops.

The "4 per cycle" cases are an extrapolation to the infinitely-unrolled case. Loop overhead takes up some of the front-end bandwidth, but anything better than 1 per cycle is an indication that register-renaming avoided the write-after-write output dependency, and that the uop isn't handled internally as a read-modify-write.

Writing to AH only: prevents the loop from executing from the loopback buffer (aka the Loop Stream Detector (LSD)). Counts for lsd.uops are exactly 0 on HSW, and tiny on SKL (around 1.8k) and don't scale with the loop iteration count. Probably those counts are from some kernel code. When loops do run from the LSD, lsd.uops ~= uops_issued to within measurement noise. Some loops alternate between LSD or no-LSD (e.g when they might not fit into the uop cache if decode starts in the wrong place), but I didn't run into that while testing this.

repeated mov ah, bh and/or mov ah, bl runs at 4 per cycle. It takes an ALU uop, so it's not eliminated like mov eax, ebx is.
repeated mov ah, [rsi] runs at 2 per cycle (load throughput bottleneck).
repeated mov ah, 123 runs at 1 per cycle. (A dep-breaking xor eax,eax inside the loop removes the bottleneck.)
repeated setz ah or setc ah runs at 1 per cycle. (A dep-breaking xor eax,eax lets it bottleneck on p06 throughput for setcc and the loop branch.)

Why does writing ah with an instruction that would normally use an ALU execution unit have a false dependency on the old value, while mov r8, r/m8 doesn't (for reg or memory src)? (And what about mov r/m8, r8? Surely it doesn't matter which of the two opcodes you use for reg-reg moves?)
repeated add ah, 123 runs at 1 per cycle, as expected.
repeated add dh, cl runs at 1 per cycle.
repeated add dh, dh runs at 1 per cycle.
repeated add dh, ch runs at 0.5 per cycle. Reading [ABCD]H is special when they're "clean" (in this case, RCX is not recently modified at all).

Terminology: All of these leave AH (or DH) "dirty", i.e. in need of merging (with a merging uop) when the rest of the register is read (or in some other cases). i.e. that AH is renamed separately from RAX, if I'm understanding this correctly. "clean" is the opposite. There are many ways to clean a dirty register, the simplest being inc eax or mov eax, esi.

Writing to AL only: These loops do run from the LSD: uops_issue.any ~= lsd.uops.

repeated mov al, bl runs at 1 per cycle. An occasional dep-breaking xor eax,eax per group lets OOO execution bottleneck on uop throughput, not latency.
repeated mov al, [rsi] runs at 1 per cycle, as a micro-fused ALU+load uop. (uops_issued=4G + loop overhead, uops_executed=8G + loop overhead). A dep-breaking xor eax,eax before a group of 4 lets it bottleneck on 2 loads per clock.
repeated mov al, 123 runs at 1 per cycle.
repeated mov al, bh runs at 0.5 per cycle. (1 per 2 cycles). Reading [ABCD]H is special.
xor eax,eax + 6x mov al,bh + dec ebp/jnz: 2c per iter, bottleneck on 4 uops per clock for the front-end.
repeated add dl, ch runs at 0.5 per cycle. (1 per 2 cycles). Reading [ABCD]H apparently creates extra latency for dl.
repeated add dl, cl runs at 1 per cycle.

I think a write to a low-8 reg behaves as a RMW blend into the full reg, like add eax, 123 would be, but it doesn't trigger a merge if ah is dirty. So (other than ignoring AH merging) it behaves the same as on CPUs that don't do partial-reg renaming at all. It seems AL is never renamed separately from RAX?

inc al/inc ah pairs can run in parallel.
mov ecx, eax inserts a merging uop if ah is "dirty", but the actual mov is renamed. This is what Agner Fog describes for IvyBridge and later.
repeated movzx eax, ah runs at one per 2 cycles. (Reading high-8 registers after writing full regs has extra latency.)
movzx ecx, al has zero latency and doesn't take an execution port on HSW and SKL. (Like what Agner Fog describes for IvyBridge, but he says HSW doesn't rename movzx).
movzx ecx, cl has 1c latency and takes an execution port. (mov-elimination never works for the same,same case, only between different architectural registers.)

A loop that inserts a merging uop every iteration can't run from the LSD (loop buffer)?

I don't think there's anything special about AL/AH/RAX vs. B*, C*, DL/DH/RDX. I have tested some with partial regs in other registers (even though I'm mostly showing AL/AH for consistency), and have never noticed any difference.

How can we explain all of these observations with a sensible model of how the microarch works internally?

Related: Partial flag issues are different from partial register issues. See INC instruction vs ADD 1: Does it matter? for some super-weird stuff with shr r32,cl (and even shr r32,2 on Core2/Nehalem: don't read flags from a shift other than by 1).

See also Problems with ADC/SBB and INC/DEC in tight loops on some CPUs for partial-flag stuff in adc loops.

Harrison answered 13/8, 2017 at 12:5 Comment(3)

It was hard writing this up in a Q&A format. That took longer than the actual experimentation. But I think I managed to create something that could usefully be answered by someone else, with a question that isn't too simple. IDK if putting most of this into the answer would have been "better", but I wanted the question title to summarize the important part. – Harrison 13/8, 2017 at 18:7

Your labor of love was certainly useful, though. This cleared up some confusion I had. I was not aware that HSW/SKL no longer issued the merge uop after ALU ops writing partial registers. My copy of the manual from May 2020 states, "Beginning with Sandy Bridge microarchitecture and all subsequent generations of Intel Core microarchitecture, partial register access is handled in hardware by inserting a micro-op that merges the partial register with the full register in the following cases" (emphasis mine). It fails to clarify that this applies to MOV but not to other instructions. – Kokanee 11/2, 2021 at 17:3

Why does repeated mov al, 123 runs at 1 per cycle? but movl eax, 123 repeated runs at 4cycles / iteration? Nevermind, its because mov al, 123 is not dependency breaking. – Celt 1/6, 2021 at 21:16

Other answers welcome to address Sandybridge and IvyBridge in more detail. I don't have access to that hardware.

I haven't found any partial-reg behaviour differences between HSW and SKL. On Haswell and Skylake, everything I've tested so far supports this model:

AL is never renamed separately from RAX (or r15b from r15). So if you never touch the high8 registers (AH/BH/CH/DH), everything behaves exactly like on a CPU with no partial-reg renaming (e.g. AMD).

Write-only access to AL merges into RAX, with a dependency on RAX. For loads into AL, this is a micro-fused ALU+load uop that executes on p0156, which is one of the strongest pieces of evidence that it's truly merging on every write, and not just doing some fancy double-bookkeeping as Agner speculated.

Agner (and Intel) say Sandybridge can require a merging uop for AL, so it probably is renamed separately from RAX. For SnB, Intel's optimization manual (section 3.5.2.4 Partial Register Stalls) says

SnB (not necessarily later uarches) inserts a merging uop in the following cases:

After a write to one of the registers AH, BH, CH or DH and before a following read of the 2-, 4- or 8-byte form of the same register. In these cases a merge micro-op is inserted. The insertion consumes a full allocation cycle in which other micro-ops cannot be allocated.

After a micro-op with a destination register of 1 or 2 bytes, which is not a source of the instruction (or the register's bigger form), and before a following read of a 2-,4- or 8-byte form of the same register. In these cases the merge micro-op is part of the flow.

I think they're saying that on SnB, add al,bl will RMW the full RAX instead of renaming it separately, because one of the source registers is (part of) RAX. My guess is that this doesn't apply for a load like mov al, [rbx + rax]; rax in an addressing mode probably doesn't count as a source.

I haven't tested whether high8 merging uops still have to issue/rename on their own on HSW/SKL. That would make the front-end impact equivalent to 4 uops (since that's the issue/rename pipeline width).

There is no way to break a dependency involving AL without writing EAX/RAX. xor al,al doesn't help, and neither does mov al, 0.
movzx ebx, al has zero latency (renamed), and needs no execution unit. (i.e. mov-elimination works on HSW and SKL). It triggers merging of AH if it's dirty, which I guess is necessary for it to work without an ALU. It's probably not a coincidence that Intel dropped low8 renaming in the same uarch that introduced mov-elimination. (Agner Fog's micro-arch guide has a mistake here, saying that zero-extended moves are not eliminated on HSW or SKL, only IvB.)
movzx eax, al is not eliminated at rename. mov-elimination on Intel never works for same,same. mov rax,rax isn't eliminated either, even though it doesn't have to zero-extend anything. (Although there'd be no point to giving it special hardware support, because it's just a no-op, unlike mov eax,eax). Anyway, prefer moving between two separate architectural registers when zero-extending, whether it's with a 32-bit mov or an 8-bit movzx.
movzx eax, bx is not eliminated at rename on HSW or SKL. It has 1c latency and uses an ALU uop. Intel's optimization manual only mentions zero-latency for 8-bit movzx (and points out that movzx r32, high8 is never renamed).

High-8 regs can be renamed separately from the rest of the register, and do need merging uops.

Write-only access to ah with mov ah, reg8 or mov ah, [mem8] do rename AH, with no dependency on the old value, unlike with mov-immediate. These are both instructions that wouldn't normally need an ALU uop for the 32-bit version. (But mov ah, bl is not eliminated; it does need a p0156 ALU uop so that might be a coincidence).
a RMW of AH (like inc ah) dirties it (so it's renamed separately and needs a merging uop if AX, EAX, or RAX is read later, including as part of writing AX.)
setcc ah depends on the old ah, but still dirties it. I think mov ah, imm8 is the same, but haven't tested as many corner cases.

(Unexplained: a loop involving setcc ah can sometimes run from the LSD, see the rcr loop at the end of this post. Maybe as long as ah is clean at the end of the loop, it can use the LSD?).

If ah is dirty, setcc ah merges into the renamed ah, rather than forcing a merge into rax. e.g. %rep 4 (inc al / test ebx,ebx / setcc ah / inc al / inc ah) generates no merging uops, and only runs in about 8.7c (latency of 8 inc al slowed down by resource conflicts from the uops for ah. Also the inc ah / setcc ah dep chain).

I think what's going on here is that setcc r8 is always implemented as a read-modify-write. Intel probably decided that it wasn't worth having a write-only setcc uop to optimize the setcc ah case, since it's very rare for compiler-generated code to setcc ah. (But see the godbolt link in the question: clang4.0 with -m32 will do so.)

Reading AX, EAX, or RAX triggers a merge uop (which takes up front-end issue/rename bandwidth). Probably the RAT (Register Allocation Table) tracks the high-8-dirty state for the architectural R[ABCD]X, and even after a write to AH retires, the AH data is stored in a separate physical register from RAX. Even with 256 NOPs between writing AH and reading EAX, there is an extra merging uop. (ROB size=224 on SKL, so this guarantees that the mov ah, 123 was retired). Detected with uops_issued/executed perf counters, which clearly show the difference.
Read-modify-write of AL (e.g. inc al) merges for free, as part of the ALU uop. (Only tested with a few simple uops, like add/inc, not div r8 or mul r8). Again, no merging uop is triggered even if AH is dirty.
Write-only to EAX/RAX (like lea eax, [rsi + rcx] or xor eax,eax) clears the AH-dirty state (no merging uop).
Write-only to AX (mov ax, 1 or mov ax, bx) triggers a merge of AH first. I guess instead of special-casing this, it runs like any other RMW of AX/RAX.
xor ah,ah has 1c latency, is not dep-breaking, and still needs an execution port. mov ah,0 is the same; like with setcc, perhaps mov-immediate to 8-bit reg isn't special-cased for high-8 registers, unlike with mov reg,reg
Read and/or write of AL does not force a merge, so AH can stay dirty (and be used independently in a separate dep chain). (e.g. add ah, cl / add al, dl can run at 2 IPC, one pair per clock (bottlenecked on add latency).

Making AH dirty prevents a loop from running from the LSD (the loop-buffer), even when there are no merging uops. The LSD is when the CPU recycles uops in the queue that feeds the issue/rename stage. (Called the IDQ).

Inserting merging uops is a bit like inserting stack-sync uops for the stack-engine. Intel's optimization manual says that SnB's LSD can't run loops with mismatched push/pop, which makes sense, but it implies that it can run loops with balanced push/pop. That's not what I'm seeing on SKL: even balanced push/pop prevents running from the LSD (e.g. push rax / pop rdx / times 6 imul rax, rdx. (There may be a real difference between SnB's LSD and HSW/SKL: SnB may just "lock down" the uops in the IDQ instead of repeating them multiple times, so a 5-uop loop takes 2 cycles to issue instead of 1.25.) Anyway, it appears that HSW/SKL can't use the LSD when a high-8 register is dirty, or when it contains stack-engine uops.

This behaviour may be related to a an erratum in SKL:

SKL150: Short Loops Which Use AH/BH/CH/DH Registers May Cause Unpredictable System Behaviour

Problem: Under complex micro-architectural conditions, short loops of less than 64 instruction that use AH, BH, CH, or DH registers as well as their corresponding wider registers (e.g. RAX, EAX, or AX for AH) may cause unpredictable system behaviour. This can only happen when both logical processors on the same physical processor are active.

This may also be related to Intel's optimization manual statement that SnB at least has to issue/rename an AH-merge uop in a cycle by itself. That's a weird difference for the front-end.

My Linux kernel log says microcode: sig=0x506e3, pf=0x2, revision=0x84. Arch Linux's intel-ucode package just provides the update, you have to edit config files to actually have it loaded. So my Skylake testing was on an i7-6700k with microcode revision 0x84, which doesn't include the fix for SKL150. It matches the Haswell behaviour in every case I tested, IIRC. (e.g. both Haswell and my SKL can run the setne ah / add ah,ah / rcr ebx,1 / mov eax,ebx loop from the LSD). I have HT enabled (which is a pre-condition for SKL150 to manifest), but I was testing on a mostly-idle system so my thread had the core to itself.

With updated microcode, the LSD is completely disabled for everything all the time, not just when partial registers are active. lsd.uops is always exactly zero, including for real programs not synthetic loops. Hardware bugs (rather than microcode bugs) often require disabling a whole feature to fix. This is why SKL-avx512 (SKX) is reported to not have a loopback buffer. Fortunately this is not a performance problem: SKL's increased uop-cache throughput over Broadwell can almost always keep up with issue/rename.

Extra AH/BH/CH/DH latency:

Reading AH when it's not dirty (renamed separately) adds an extra cycle of latency for both operands. e.g. add bl, ah has a latency of 2c from input BL to output BL, so it can add latency to the critical path even if RAX and AH are not part of it. (I've seen this kind of extra latency for the other operand before, with vector latency on Skylake, where an int/float delay "pollutes" a register forever. TODO: write that up.)

This means unpacking bytes with movzx ecx, al / movzx edx, ah has extra latency vs. movzx/shr eax,8/movzx, but still better throughput.

Reading AH when it is dirty doesn't add any latency. (add ah,ah or add ah,dh/add dh,ah have 1c latency per add). I haven't done a lot of testing to confirm this in many corner-cases.

Hypothesis: a dirty high8 value is stored in the bottom of a physical register. Reading a clean high8 requires a shift to extract bits [15:8], but reading a dirty high8 can just take bits [7:0] of a physical register like a normal 8-bit register read.

Extra latency doesn't mean reduced throughput. This program can run at 1 iter per 2 clocks, even though all the add instructions have 2c latency (from reading DH, which is not modified.)

global _start
_start:
    mov     ebp, 100000000
.loop:
    add ah, dh
    add bh, dh
    add ch, dh
    add al, dh
    add bl, dh
    add cl, dh
    add dl, dh

    dec ebp
    jnz .loop

    xor edi,edi
    mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
    syscall       ; sys_exit_group(0)

 Performance counter stats for './testloop':

     48.943652      task-clock (msec)         #    0.997 CPUs utilized          
             1      context-switches          #    0.020 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
             3      page-faults               #    0.061 K/sec                  
   200,314,806      cycles                    #    4.093 GHz                    
   100,024,930      branches                  # 2043.675 M/sec                  
   900,136,527      instructions              #    4.49  insn per cycle         
   800,219,617      uops_issued_any           # 16349.814 M/sec                 
   800,219,014      uops_executed_thread      # 16349.802 M/sec                 
         1,903      lsd_uops                  #    0.039 M/sec                  

   0.049107358 seconds time elapsed

Some interesting test loop bodies:

%if 1
     imul eax,eax
     mov  dh, al
     inc dh
     inc dh
     inc dh
;     add al, dl
    mov cl,dl
    movzx eax,cl
%endif

Runs at ~2.35c per iteration on both HSW and SKL.  reading `dl` has no dep on the `inc dh` result.  But using `movzx eax, dl` instead of `mov cl,dl` / `movzx eax,cl` causes a partial-register merge, and creates a loop-carried dep chain.  (8c per iteration).


%if 1
    imul  eax, eax
    imul  eax, eax
    imul  eax, eax
    imul  eax, eax
    imul  eax, eax         ; off the critical path unless there's a false dep

  %if 1
    test  ebx, ebx          ; independent of the imul results
    ;mov   ah, 123         ; dependent on RAX
    ;mov  eax,0           ; breaks the RAX dependency
    setz  ah              ; dependent on RAX
  %else
    mov   ah, bl          ; dep-breaking
  %endif

    add   ah, ah
    ;; ;inc   eax
;    sbb   eax,eax

    rcr   ebx, 1      ; dep on  add ah,ah  via CF
    mov   eax,ebx     ; clear AH-dirty

    ;; mov   [rdi], ah
    ;; movzx eax, byte [rdi]   ; clear AH-dirty, and remove dep on old value of RAX
    ;; add   ebx, eax          ; make the dep chain through AH loop-carried
%endif

The setcc version (with the %if 1) has 20c loop-carried latency, and runs from the LSD even though it has setcc ah and add ah,ah.

00000000004000e0 <_start.loop>:
  4000e0:       0f af c0                imul   eax,eax
  4000e3:       0f af c0                imul   eax,eax
  4000e6:       0f af c0                imul   eax,eax
  4000e9:       0f af c0                imul   eax,eax
  4000ec:       0f af c0                imul   eax,eax
  4000ef:       85 db                   test   ebx,ebx
  4000f1:       0f 94 d4                sete   ah
  4000f4:       00 e4                   add    ah,ah
  4000f6:       d1 db                   rcr    ebx,1
  4000f8:       89 d8                   mov    eax,ebx
  4000fa:       ff cd                   dec    ebp
  4000fc:       75 e2                   jne    4000e0 <_start.loop>

 Performance counter stats for './testloop' (4 runs):

       4565.851575      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.08% )
                 4      context-switches          #    0.001 K/sec                    ( +-  5.88% )
                 0      cpu-migrations            #    0.000 K/sec                  
                 3      page-faults               #    0.001 K/sec                  
    20,007,739,240      cycles                    #    4.382 GHz                      ( +-  0.00% )
     1,001,181,788      branches                  #  219.276 M/sec                    ( +-  0.00% )
    12,006,455,028      instructions              #    0.60  insn per cycle           ( +-  0.00% )
    13,009,415,501      uops_issued_any           # 2849.286 M/sec                    ( +-  0.00% )
    12,009,592,328      uops_executed_thread      # 2630.307 M/sec                    ( +-  0.00% )
    13,055,852,774      lsd_uops                  # 2859.456 M/sec                    ( +-  0.29% )

       4.565914158 seconds time elapsed                                          ( +-  0.08% )

Unexplained: it runs from the LSD, even though it makes AH dirty. (At least I think it does. TODO: try adding some instructions that do something with eax before the mov eax,ebx clears it.)

But with mov ah, bl, it runs in 5.0c per iteration (imul throughput bottleneck) on both HSW/SKL. (The commented-out store/reload works, too, but SKL has faster store-forwarding than HSW, and it's variable-latency...)

 #  mov ah, bl   version
 5,009,785,393      cycles                    #    4.289 GHz                      ( +-  0.08% )
 1,000,315,930      branches                  #  856.373 M/sec                    ( +-  0.00% )
11,001,728,338      instructions              #    2.20  insn per cycle           ( +-  0.00% )
12,003,003,708      uops_issued_any           # 10275.807 M/sec                   ( +-  0.00% )
11,002,974,066      uops_executed_thread      # 9419.678 M/sec                    ( +-  0.00% )
         1,806      lsd_uops                  #    0.002 M/sec                    ( +-  3.88% )

   1.168238322 seconds time elapsed                                          ( +-  0.33% )

Notice that it doesn't run from the LSD anymore.

2023 update

Some more tests, still on i7-6700k Skylake, this time with newer microcode (version 0xf0) that permanently disabled the LSD (because of AH-merging correctness issue in corner cases, apparently.)

.loop:
   times 6 mov ah, cl   ; or as low as times 3.  But times 7 makes the loop slower, 4c / iter
    mov ax, bx          ; trigger an AH merge
    dec ebp
    jnz .loop
.end:

With times 1 or times 2 mov ah, cl: 2 cycles / iter.
(4 or 5 uops issued and executed / iter)
With times 3 to times 6 mov ah, cl: 3 cycles / iter
(6 to 10 uops issued and executed)
With times 7 to times 10 mov ah, cl: 4 cycles / iter

The bottleneck with mov ah, cl is I think front-end throughput. An AH-merging uop needing to issue in a cycle by itself would explain this, e.g. an upper limit of 5 uops in 2 cycles, with 4 in one cycle, 1 in the other. And increasing by 1 cycle for each 4 more uops. That front-end bubble for the AH-merging uop should give the uop cache time to keep up.

This also proves mov ah, cl doesn't depend on the old RAX after merge: if there was an output dependency, it couldn't run faster than 1 instruction per cycle (not counting the dec/jne).

But xor ah,ah or mov ah,0 does create a loop-carried dependency chain, exactly like inc ah, confirming that mov ax, bx has a dependency on AH. It limits speed to n+3 cycles per iteration, where n is the length of the times n mov ah,0 chain. For example, times 7 takes 10 cycles, with 10 uops issued and executed. (From 10 instructions, including a macro-fused dec/jne, so there is an extra uop in there.)

It seems the latency cost of the AH-merging uop is 1 cycle, and the mov ax, bx itself costs another one cycle to merge a new value into the bottom of RAX. The extra 1 cycle is coming when reading AH after writing RAX. Adding a mov ah, al keeps the dependency on RAX, but without reading AH.

Harrison answered 13/8, 2017 at 12:5 Comment(26)

Do you have the firmware fix for the SKL150 erratum? There is some speculation it may change the LSD behavior in some or all cases, so it's worthwhile to note if the results above are pre or post firmware update. – Velodrome 13/8, 2017 at 19:58

That one has the fix, AFAIK. – Velodrome 13/8, 2017 at 20:39

This is great. Can you provide some more detail on how you're running the tests? I'm assuming this is using perf to some extent. – Paralyse 13/8, 2017 at 21:7

It would be good to introduce what you mean by "dirty ah". It just starts getting used, but the definition isn't clear. It seems to be something like "ah has been separately renamed, and not yet merged into the full reg". I also didn't understand setcc ah depends on the old ah, but still dirties it. Does "depends" here mean logically (I don't think it does?) or apparently falsely/performance-wise? – Velodrome 13/8, 2017 at 21:22

@BeeOnRope: Thanks for the editorial feedback. I got frustrated with how long it was taking to write, and posted before doing a re-read of the whole answer once I had the question in good shape. "Dirty AH" means a merge uop will be inserted if you inc eax. The setcc ah dependency thing is an output dependency on AH, but not necessarily on the whole RAX. So yes, it's a false dep preventing setcc ah from running at 2 per clock or from starting a new dep chain for something like inc ah. I'll re-edit the answer soon. – Harrison 13/8, 2017 at 22:8

@JonathonReinhart: That detail is hidden in the question, where I mention that I use the same method as in this answer: assemble a static binary that does enough iterations of the loop to hide process-startup and sys_exit() overhead. – Harrison 13/8, 2017 at 22:10

This is [x86] investigative journalism at its best. Thanks! – Prowess 14/8, 2017 at 1:48

@BeeOnRope: Just double-checked: Arch Linux's intel-ucode package doesn't automatically enable microcode updates when you install it; you have to edit your bootloader config file. I didn't have the fix while testing on SKL, but I didn't have two threads on one core. I don't remember noticing any results where AH / LSD behaved differently on HSW vs. SKL. When I have some time after next reboot, I'll re-test and see if there are any new cases where SKL won't use the LSD. – Harrison 14/8, 2017 at 4:40

Something might have changed dramatically in this area for ICL. See this ICL Instlat dump. It shows many of the instructions with r8 destination suddenly at 1c throughput from the usual 0.25c or whatever depending on the operation. I don't know what mix of high and low registers are used in these tests, but in any case it seems like there might have been a big change here. – Velodrome 3/8, 2019 at 2:0

@BeeOnRope: Oh wow, either InstLatx64 broke their tests and created a latency bottleneck, or ICL only runs stuff like mov al, imm8 and add al, bl on one port (vs. any ALU port on Haswell and later to merge into the low 8 of a full reg). That makes little sense; clang uses 8-bit operand size pretty carelessly so you'd expect this to be a measurable slowdown. But mov r16, r16 is still 0.25c throughput and it makes no sense to drop low-8 merging EUs but not low-16 from most ports. I'd really like to see the same version of Instlat run on a Skylake to see if it's wrong there. – Harrison 3/8, 2019 at 2:19

@PeterCordes - I doubt anything changed in all those test just before running it on ICL there: all the earlier dumps look normal in this respect. I'm thinking the tests may use say both al and ah for the T tests: this would have worked at full speed on contemporary chips as you describe above (ah renamed), but that something changed in ICL to limit this to one per cycle. I could be a latency effect, e.g., maybe they use 4 parallel chains and the latency went to 4 cycles which ends up showing up as T=1. – Velodrome 3/8, 2019 at 2:50

Check out XCHG r1_8, r2_8 for more weirdness. The latency shot up to 4c and T improved to 0.33 - what? The other widths are unchanged. – Velodrome 3/8, 2019 at 2:54

@BeeOnRope: ok that's possible. I'd certainly believe that AH/BH/CH/DH changed in ICL (like maybe not renaming them, or more penalty for using them), but I really hope InstLatx64 wasn't using them! We can rule out 4 cycle latency for add al, bl and mov al, imm8 - the latency test shows them as 1c. I'm still almost certain the numbers are bogus somehow (probably accurately measuring badly-designed tests) so I'm not prepared to believe that add r8, r8 really only has 1c throughput. – Harrison 3/8, 2019 at 2:54

@BeeOnRope: we can see front-end bottlenecks from LCP stalls in add ax, imm16 and some other non-mov imm16 instructions. That's another possible source of error. AH-merging can apparently cause a front-end delay on Haswell+ so maybe something like that is happening. – Harrison 3/8, 2019 at 2:59

@PeterCordes - yeah I don't mean that add al, bl would have 4 cycle latency, but some combination of ah and al use would create a 4-cycle dependency chain, which wasn't a chain at all before. E.g., if ah isn't renamed, suddenly independent ah and al use which were independent 1-cycle chains would combine together into a 2 cycle chain at a minimum. If there are some penalties for using ah, the chain could be longer. – Velodrome 3/8, 2019 at 4:29

... but yeah I don't think add r8, r8 or the other instructions will really turn out to be 1T. – Velodrome 3/8, 2019 at 4:30

About the "Write-only access to ah with mov ah, r8" part - was this eliminated, i.e., not uop? You mention "These are both instructions that wouldn't normally need an ALU uop (for the 32-bit version)." but does the 8-bit version take a uop or not? – Velodrome 10/10, 2019 at 1:50

@BeeOnRope: good question, just tested and it's not eliminated. Updated that bullet point. It needs a p0156 uop. A loop with 7 mov ah, bl and one dec/jnz runs at 2.001c / iter and basically saturates all 4 ALU ports, proving that it's really p0156 and no false dependency. – Harrison 10/10, 2019 at 2:7

@PeterCordes Judgement question. For generic x86_64 code (thinking GLIBC's generic x86_64 memcmp) do you think it makes sense to avoid partial registers at the cost of code size or since all SnB+ handle it well (w.o a stall) it worth the bytes? – Celt 8/4, 2022 at 7:6

@Noah: If it's inside an important loop and would create a performance disaster on Nehalem or earlier (stalling the front-end for ~3 cycles), I think we should still avoid it. There are still Nehalem and even Core2 systems around. Especially if the gain is just code-size in bytes, not even uops. In a function using AVX, that rules out pre-SnB, and SnB itself handles it ok. (Modern Pentium / Celeron pre IceLake don't have AVX1 or 2 so they and Tremont etc. are the only recent CPUs using non-AVX versions of functions. Bulldozer has AVX. Except in a VM that doesn't pass through AVX.) – Harrison 8/4, 2022 at 7:12

@PeterCordes makes sense. Thinking mostly in the context of replacing subl $0xff, %eax; jnz with incb %al; jnz (or 16-bit) or movzbl (%rdi), %ecx; orl %ecx, %esi with movb (%rdi), %sil on return values w.o false dependency. Guess the latter case actually saves an instruction. But memcmp can be called in a loop (i.e qsort) so some merit to protecting throughput in those cases. – Celt 8/4, 2022 at 17:19

@Noah: Ok yeah, should be safe in memcmp-avx2-movbe.S, no P6 CPUs support AVX2. So for example, subl $0xffff, %eax (5 bytes) can be replaced with incw %ax (3 bytes), following vpmovmskb %xmm2, %eax in the L(less_vec): block. Looks like replacing instructions that use $VEC_MASK would be harder, since you need a parameterized register name instead of a number. Would make the code less readable even with a macro like INC_VECSIZE(a) which expands to inc %al or inc %ax (The reg implies the suffix, but not vice versa, you can't incb %a – Harrison 8/4, 2022 at 22:28

@PeterCordes not so concerned with memcmp-avx2-movsb as that's in the page cross case which is relatively cold (although I realize L(zero) shouldn't aligned, ill fix that this weekend). I'm refactoring memcmp so that we can drop the sse4/ssse3 versions. memcmp is the x86_64 generic impl so older targets need to be kept in mind. Leaving one partial register stall in the n={2,3} that I can't find a good way to avoid without causing the case to split a cache-line. – Celt 8/4, 2022 at 22:37

@PeterCordes Want to leave the code bloated verison in memcmp-avx2-movsb. Once I fix alignment of L(zero) target will be ideal code layout from icache and DSB/LSD perspective. Especially in cold cases tend to optimize more for ensuring minimum number of Icache lines are brought in. – Celt 8/4, 2022 at 22:43

@Noah: Yeah, good plan, that makes sense. – Harrison 8/4, 2022 at 22:44

@PeterCordes oh wow, theres actually a full cache line to be saved in memcmp-avx2-movbe :) – Celt 8/4, 2022 at 23:25

Update: Possible evidence that IvyBridge still renames low16 / low8 registers separately from the full register, like Sandybridge but unlike Haswell and later.

InstLatX64 results from SnB and IvB show 0.33c throughput for movsx r16, r8 (as expected, movsx is never eliminated and there were only 3 ALUs before Haswell).

But apparently InstLat's movsx r16, r8 test bottlenecks Haswell / Broadwell / Skylake at 1c throughput (see also this bug report on the instlat github). Probably by writing the same architectural register, creating a chain of merges.

(The actual throughput for that instruction with separate destination registers is 0.25c on my Skylake. Tested with 7 movsx instructions writing to eax..edi and r10w/r11w, all reading from cl. And a dec ebp/jnz as the loop branch to make an even 8 uop loop.)

If I'm guessing right about what created that 1c throughput result on CPUs after IvB, it's doing something like running a block of movsx dx, al. And that can only run at more than 1 IPC on CPUs that rename dx separately from RDX instead of merging. So we can conclude that IvB actually does still rename low8 / low16 registers separately from full registers, and it wasn't until Haswell that they dropped that. (But something is fishy here: if this explanation was right, we should see the same 1c throughput on AMD which doesn't rename partial registers. But we don't, see below.)

Results with ~0.33c throughput for the movsx r16, r8 (and movzx r16, r8) tests:

Haswell results with a mysterious 0.58c throughput for movsx/zx r16, r8:

A Haswell result with the same 4.3.764.0 Jul 10 2017 build of AIDA64
Haswell-E with a 2014 build

Other earlier and later Haswell (and CrystalWell) / Broadwell / Skylake results are all 1.0c throughput for those two tests.

HSW with 4.1.570.0 Jun 5 2013, BDW with 4.3.15787.0 Oct 12 2018, BDW with 4.3.739.0 Mar 17 2017.

As I reported in the linked InstLat issue on github, the "latency" numbers for movzx r32, r8 ignore mov-elimination, presumably testing like movzx eax, al.

Even worse, the newer versions of InstLatX64 with separate-registers versions of the test, like MOVSX r1_32, r2_8, show latency numbers below 1 cycle, like 0.3c for that MOVSX on Skylake. This is total nonsense; I tested just to be sure.

The MOVSX r1_16, r2_8 test does show 1c latency, so apparently they're just measuring the latency of the output (false) dependency. (Which doesn't exist for 32-bit and wider outputs).

But that MOVSX r1_16, r2_8 test measured 1c latency on Sandybridge as well! So maybe my theory was wrong about what the movsx r16, r8 test is telling us.

On Ryzen (AIDA64 build 4.3.781.0 Feb 21 2018), which we know doesn't do any partial-register renaming at all, the results don't show the 1c throughput effect that we'd expect if the test was really writing the same 16-bit register repeatedly. I don't find it on any older AMD CPUs either, with older versions of InstLatX64, like K10 or Bulldozer-family.

## Instlat Zen tests of ... something?
  43 X86     :MOVSX r16, r8                L:   0.28ns=  1.0c  T:   0.11ns=  0.40c
  44 X86     :MOVSX r32, r8                L:   0.28ns=  1.0c  T:   0.07ns=  0.25c
  45 AMD64   :MOVSX r64, r8                L:   0.28ns=  1.0c  T:   0.12ns=  0.43c
  46 X86     :MOVSX r32, r16               L:   0.28ns=  1.0c  T:   0.12ns=  0.43c
  47 AMD64   :MOVSX r64, r16               L:   0.28ns=  1.0c  T:   0.13ns=  0.45c
  48 AMD64   :MOVSXD r64, r32              L:   0.28ns=  1.0c  T:   0.13ns=  0.45c

IDK why throughput isn't 0.25 for all of them; seems weird. This might be a version of the 0.58c Haswell throughput effect. MOVZX numbers are the same, with 0.25 throughput for the no-prefixes version that reads R8 and writes an R32. Maybe there's a bottleneck on fetch/decode for larger instructions? But movsx r32, r16 is the same size as movsx r32, r8.

The separate-reg tests show the same pattern as on Intel, though, with 1c latency only for the one that has to merge. MOVZX is the same.

## Instlat Zen separate-reg tests
2252 X86     :MOVSX r1_16, r2_8            L:   0.28ns=  1.0c  T:   0.08ns=  0.28c
2253 X86     :MOVSX r1_32, r2_8            L:   0.07ns=  0.3c  T:   0.07ns=  0.25c
2254 AMD64   :MOVSX r1_64, r2_8            L:   0.07ns=  0.3c  T:   0.07ns=  0.25c
2255 X86     :MOVSX r1_32, r2_16           L:   0.07ns=  0.3c  T:   0.07ns=  0.25c

Excavator results are also pretty similar to this, but of course lower throughput.

https://www.uops.info/table.html confirms that Zen+ has the expected 0.25c throughput (and 1c latency) for MOVSX_NOREX (R16, R8), same as Instlat found with their separate-reg tests.

Perhaps InstLat's throughput test for MOVSX r16, r8 (not MOVSX r1_16, r2_8) only uses 2 or 3 dep chains, which isn't enough for modern CPUs? Or perhaps breaks the dep chain occasionally so OoO exec can overlap some?

Harrison answered 3/5, 2019 at 3:56 Comment(2)

It looks like Zen 3's behavior has changed a bit here. Zen 2 and below seem to have 0.25 inv throughput for add r8, r8 but Zen 3 is 1.0. There are several other changes too. Apparently Zen 3 is much more hetrogeneous in its ALUs (i.e., most operations were available on all 4 ALUs in Zen 1,2 but many are available on fewer in Zen 3, so maybe the byte operations got demoted in that way?). There are also some suspect results like CMP r8, r8 showing as 0.02 inverse throughput (i.e., 50 of these can execute per cycle). – Velodrome 26/3, 2021 at 23:9

Link to 5900X results. – Velodrome 26/3, 2021 at 23:9

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

High-8 regs can be renamed separately from the rest of the register, and do need merging uops.

Extra AH/BH/CH/DH latency:

2023 update

Recommended topics

Hot tags