Can x86's MOV really be "free"? Why can't I reproduce this at all?
Asked Answered
T

2

47

I keep seeing people claim that the MOV instruction can be free in x86, because of register renaming.

For the life of me, I can't verify this in a single test case. Every test case I try debunks it.

For example, here's the code I'm compiling with Visual C++:

#include <limits.h>
#include <stdio.h>
#include <time.h>

int main(void)
{
    unsigned int k, l, j;
    clock_t tstart = clock();
    for (k = 0, j = 0, l = 0; j < UINT_MAX; ++j)
    {
        ++k;
        k = j;     // <-- comment out this line to remove the MOV instruction
        l += j;
    }
    fprintf(stderr, "%d ms\n", (int)((clock() - tstart) * 1000 / CLOCKS_PER_SEC));
    fflush(stderr);
    return (int)(k + j + l);
}

This produces the following assembly code for the loop (feel free to produce this however you want; you obviously don't need Visual C++):

LOOP:
    add edi,esi
    mov ebx,esi
    inc esi
    cmp esi,FFFFFFFFh
    jc  LOOP

Now I run this program several times, and I observe a pretty consistent 2% difference when the MOV instruction is removed:

Without MOV      With MOV
  1303 ms         1358 ms
  1324 ms         1363 ms
  1310 ms         1345 ms
  1304 ms         1343 ms
  1309 ms         1334 ms
  1312 ms         1336 ms
  1320 ms         1311 ms
  1302 ms         1350 ms
  1319 ms         1339 ms
  1324 ms         1338 ms

So what gives? Why isn't the MOV "free"? Is this loop too complicated for x86?
Is there a single example out there that can demonstrate MOV being free like people claim?
If so, what is it? And if not, why does everyone keep claiming MOV is free?

Tweak answered 24/5, 2017 at 22:16 Comment(12)
The "freeness" is about latency, which you are not measuring here. Also 2% on that is significantly less than a cycle, so due to "weird effects" onlyGifford
@harold: Can you explain what you mean? Or preferably give some example code that does measure latency? I have no idea how to measure anything that demonstrates anything about MOV being free. Sure, I can see it's cheaper than a cycle, but cheap doesn't mean free. Remove 10 moves in your code and your program might go ~20% faster...Tweak
I'll give a test case, but free in this context never meant literally free, it meant "about as free as a nop" - though that's not entirely accurate eitherGifford
@harold: I don't know if that's what most people who say this really mean... e.g. the latest instance was where I was just discussing this with someone and he was saying that after translation to microcode (I presume he meant micro-ops), the MOV instruction in a tight loop can get entirely removed. That seems to contradict what you're saying?Tweak
Well what does "entirely removed" even mean. Clearly it can't be removed before decoding, because it isn't even known what it is yet. Unsurprisingly the renaming trick can, at best, remove the mov during renaming and then not even always. Just by being there, the mov can't be entirely free.Gifford
@harold: Not on the first decoding, but it could be removed on subsequent decodings with a bit of caching right? It's a tight loop after all; the CPU sees all the code, so it can just jump ahead.Tweak
I guess so. From what I've heard/guessed about how it works, a technique like that can't actually be applied, but it's pretty hard to know anything for sure.Gifford
You added 25% more instructions, yet it is only 2% slower. You can't explain that away with "seems there is no MOV elimination". A 2% difference requires another explanation, like the core getting too hot and throttling back.Moyers
@HansPassant: The core getting too hot? You really think something like that is the difference between the two -- the core is getting too hot? I'm not trying to explain it "away", though; I'm trying to prove a claim. Sure, the CPU is good at optimizing, but that's not the claim I see. The claim I hear is that MOVs can get eliminated, which is theoretically possible but which I cannot reproduce anywhere.Tweak
Register renaming effectively eliminates the MOV from the back-end, meaning it consists of 0 µops, does not consume an execution port, and has 0 latency. However, the instruction itself still has to be decoded, which isn't free. Furthermore, it takes up space in the code, which means space in the cache. So no, a MOV is never truly free, because there are costs in the front-end, but it is often effectively free in the context of a larger block of code that is doing some meaningful operation. A 2% difference in execution speed is clearly far less than a cycle, as one would naively expect.Metalanguage
@CodyGray: An eliminated MOV takes up space in the ROB until it retires (same as an xor-zeroing instruction or even a NOP), on Intel hardware (Without any branch mispredictions, uops_retired.retire_slots will almost exactly match uops_issued.any). My mental model is that they enter the ROB (fused-domain) in an already-executed ready-to-retire state, with zero unfused-domain uops issued into the RS (scheduler). Presumably there's something non-trivial about not having a uop to retire for an instruction, maybe something about updating RIP or just rolling back mis-speculation...Pallet
...Anyway, this means that such instructions take up space in the out-of-order window on Intel, limiting the CPUs ability to hide latency. IDK how it works on AMD. In most cases the front-end cost is what's relevant, though. (Intel could design a CPU that worked differently, and sort of macro-fused eliminated instructions into the preceding one when possible, since we already know it's possible for cmp/jcc to be represented by a single uop. Retirement isn't a bottleneck, and fusing at issue time (not decode) would be more complexity and not worth the small gain in effective ROB capacity.)Pallet
P
69

Register-copy is never free for the front-end, only eliminated from actually executing in the back-end (with zero latency) by the issue/rename stage on the following CPUs:

  • AMD Bulldozer family for XMM vector registers, not integer.
  • AMD Zen family for integer and XMM vector registers. (And YMM in Zen2 and later)
    (See Agner Fog's microarch guide for details on low/high halves of YMM in BD / Zen 1)
  • Intel Ivy Bridge and later for integer and vector registers (except MMX)
  • Intel Goldmont and later low-power CPUs: XMM and integer
    (including Alder Lake E-cores integer/XMM but not YMM)
  • Not Intel Ice Lake: a microcode update disabled register-renaming as part of working around an erratum, for general-purpose integer regs. XMM/YMM/ZMM renaming still works. Tiger Lake is also affected, but not Rocket Lake or Alder Lake P-cores.
    uops.info results for mov r32, r32 - note the latency=0 for CPUs where it works. And note they show latency=1 on Ice Lake and Tiger Lake, so they retested after microcode updates.

Your experiment

The throughput of the loop in the question doesn't depend on the latency of MOV, or (on Haswell) the benefit of not using an execution unit.

The loop is still only 4 uops for the front-end to issue into the out-of-order back-end. (mov still has to be tracked by the out-of-order back-end even if it doesn't need an execution unit, but cmp/jc macro-fuses into a single uop).

Intel CPUs since Core 2 have had an issue width of 4 uops per clock, so the mov doesn't stop it from executing at (close to) one iter per clock on Haswell. It would also run at one per clock on Ivybridge (with mov-elimination), but not on Sandybridge (no mov-elimination). On SnB, it would be about one iter per 1.333c cycles, bottlenecked on ALU throughput because the mov would always need one. (SnB/IvB have only three ALU ports, while Haswell has four).

Note that special handling in the rename stage has been a thing for x87 FXCHG (swap st0 with st1) for much longer than MOV. Agner Fog lists FXCHG as 0 latency on PPro/PII/PIII (first-gen P6 core).


The loop in the question has two interlocking dependency chains (the add edi,esi depends on EDI and on the loop counter ESI), which makes it more sensitive to imperfect scheduling. A 2% slowdown vs. theoretical prediction because of seemingly-unrelated instructions isn't unusual, and small variations in order of instructions can make this kind of difference. To run at exactly 1c per iter, every cycle needs to run an INC and an ADD. Since all the INCs and ADDs are dependent on the previous iteration, out-of-order execution can't catch up by running two in a single cycle. Even worse, the ADD depends on the INC in the previous cycle, which is what I meant by "interlocking", so losing a cycle in the INC dep chain also stalls the ADD dep chain.

Also, predicted-taken branches can only run on port6, so any cycle where port6 doesn't executed a cmp/jc is a cycle of lost throughput. This happens every time an INC or ADD steals a cycle on port6 instead of running on ports 0, 1, or 5. IDK if this is the culprit, or if losing cycles in the INC/ADD dep chains themselves is the problem, or maybe some of both.

Adding the extra MOV doesn't add any execution-port pressure, assuming it's eliminated 100%, but it does stop the front-end from running ahead of the back-end execution units. (Only 3 of the 4 uops in the loop need an execution unit, and your Haswell CPU can run INC and ADD on any of its 4 ALU ports: 0, 1, 5, and 6. So the bottlenecks are:

  • the front-end max throughput of 4 uops per clock. (The loop without MOV is only 3 uops, so the front-end can run ahead).
  • taken-branch throughput of one per clock.
  • the dependency chain involving esi (INC latency of 1 per clock)
  • the dependency chain involving edi (ADD latency of 1 per clock, and also dependent on the INC from the previous iteration)

Without the MOV, the front-end can issue the loop's three uops at 4 per clock until the out-of-order back-end is full. (AFAICT, it "unrolls" tiny loops in the loop-buffer (Loop Stream Detector: LSD), so a loop with ABC uops can issue in an ABCA BCAB CABC ... pattern. The perf counter for lsd.cycles_4_uops confirms that it mostly issues in groups of 4 when it issues any uops.)

Intel CPUs assign uops to ports as they issue into the out-of-order back-end. The decision is based on counters that track how many uops for each port are already in the scheduler (aka Reservation Station, RS). When there are lots of uops in the RS waiting to execute, this works well and should usually avoid scheduling INC or ADD to port6. And I guess also avoids scheduling the INC and ADD such that time is lost from either of those dep chains. But if the RS is empty or nearly-empty, the counters won't stop an ADD or INC from stealing a cycle on port6.

I thought I was onto something here, but any sub-optimal scheduling should let the front-end catch up and keep the back-end full. I don't think we should expect the front-end to cause enough bubbles in the pipeline to explain a 2% drop below max throughput, since the tiny loop should run from the loop buffer at a very consistent 4 per clock throughput. Maybe there's something else going on.


A real example of the benefit of mov elimination.

I used lea to construct a loop that only has one mov per clock, creating a perfect demonstration where MOV-elimination succeeds 100%, or 0% of the time with mov same,same to demonstrate the latency bottleneck that produces.

Since the macro-fused dec/jnz is part of the dependency chain involving the loop counter, imperfect scheduling can't delay it. This is different from the case where cmp/jc "forks off" from the critical-path dependency chain every iteration.

_start:
    mov     ecx, 2000000000 ; each iteration decrements by 2, so this is 1G iters
align 16  ; really align 32 makes more sense in case the uop-cache comes into play, but alignment is actually irrelevant for loops that fit in the loop buffer.
.loop:
    mov eax, ecx
    lea ecx, [rax-1]    ; we vary these two instructions

    dec ecx             ; dec/jnz macro-fuses into one uop in the decoders, on Intel
    jnz .loop

.end:
    xor edi,edi    ; edi=0
    mov eax,231    ; __NR_exit_group from /usr/include/asm/unistd_64.h
    syscall        ; sys_exit_group(0)

On Intel SnB-family, LEA with one or two components in the addressing mode runs with 1c latency (See http://agner.org/optimize/, and other links in the tag wiki).

I built and ran this as a static binary on Linux, so user-space perf-counters for the whole process are measuring just the loop with negligible startup / shutdown overhead. (perf stat is really easy compared to putting perf-counter queries into the program itself)

$ yasm -felf64 -Worphan-labels -gdwarf2 mov-elimination.asm && ld -o mov-elimination mov-elimination.o &&
  objdump -Mintel -drwC mov-elimination &&
  taskset -c 1 ocperf.py stat -etask-clock,context-switches,page-faults,cycles,instructions,branches,uops_issued.any,uops_executed.thread  -r2 ./mov-elimination

Disassembly of section .text:

00000000004000b0 <_start>:
  4000b0:       b9 00 94 35 77          mov    ecx,0x77359400
  4000b5:       66 66 2e 0f 1f 84 00 00 00 00 00        data16 nop WORD PTR cs:[rax+rax*1+0x0]

00000000004000c0 <_start.loop>:
  4000c0:       89 c8                   mov    eax,ecx
  4000c2:       8d 48 ff                lea    ecx,[rax-0x1]
  4000c5:       ff c9                   dec    ecx
  4000c7:       75 f7                   jne    4000c0 <_start.loop>

00000000004000c9 <_start.end>:
  4000c9:       31 ff                   xor    edi,edi
  4000cb:       b8 e7 00 00 00          mov    eax,0xe7
  4000d0:       0f 05                   syscall 

perf stat -etask-clock,context-switches,page-faults,cycles,instructions,branches,cpu/event=0xe,umask=0x1,name=uops_issued_any/,cpu/event=0xb1,umask=0x1,name=uops_executed_thread/ -r2 ./mov-elimination

 Performance counter stats for './mov-elimination' (2 runs):

    513.242841      task-clock:u (msec)       #    1.000 CPUs utilized    ( +-  0.05% )
             0      context-switches:u        #    0.000 K/sec                  
             1      page-faults:u             #    0.002 K/sec                  
 2,000,111,934      cycles:u                  #    3.897 GHz              ( +-  0.00% )
 4,000,000,161      instructions:u            #    2.00  insn per cycle   ( +-  0.00% )
 1,000,000,157      branches:u                # 1948.396 M/sec            ( +-  0.00% )
 3,000,058,589      uops_issued_any:u         # 5845.300 M/sec            ( +-  0.00% )
 2,000,037,900      uops_executed_thread:u    # 3896.865 M/sec            ( +-  0.00% )

   0.513402352 seconds time elapsed                                          ( +-  0.05% )

As expected, the loop runs 1G times (branches ~= 1 billion). The "extra" 111k cycles beyond 2G is overhead that's present in the other tests, too, including the one with no mov. It's not from occasional failure of mov-elimination, but it does scale with the iteration count so it's not just startup overhead. It's probably from timer interrupts, since IIRC Linux perf doesn't mess around with perf-counters while handling interrupts, and just lets them keep counting. (perf virtualizes the hardware performance counters so you can get per-process counts even when a thread migrates across CPUs.) Also, timer interrupts on the sibling logical core that shares the same physical core will perturb things a bit.

The bottleneck is the loop-carried dependency chain involving the loop counter. 2G cycles for 1G iters is 2 clocks per iteration, or 1 clock per decrement. This confirms that the length of the dep chain is 2 cycles. This is only possible if mov has zero latency. (I know it doesn't prove that there isn't some other bottleneck. It really only proves that the latency is at most 2 cycles, if you don't believe my assertion that latency is the only bottleneck. There's a resource_stalls.any perf counter, but it doesn't have many options for breaking down which microarchitectural resource was exhausted.)

The loop has 3 fused-domain uops: mov, lea, and macro-fused dec/jnz. The 3G uops_issued.any count confirms that: It counts in the fused domain, which is all of the pipeline from decoders to retirement, except for the scheduler (RS) and execution units. (macro-fused instruction-pairs stay as single uop everywhere. It's only for micro-fusion of stores or ALU+load that 1 fused-domain uop in the ROB tracks the progress of two unfused-domain uops.)

2G uops_executed.thread (unfused-domain) tells us that all the mov uops were eliminated (i.e. handled by the issue/rename stage, and placed in the ROB in an already-executed state). They still take up issue/retire bandwidth, and space in the uop cache, and code-size. They take up space in the ROB, limiting out-of-order window size. A mov instruction is never free. There are many possible microarchitectural bottlenecks besides latency and execution ports, the most important often being the 4-wide issue rate of the front-end.

On Intel CPUs, being zero latency is often a bigger deal than not needing an execution unit, especially in Haswell and later where there are 4 ALU ports. (But only 3 of them can handle vector uops, so non-eliminated vector moves would be a bottleneck more easily, especially in code without many loads or stores taking front-end bandwidth (4 fused-domain uops per clock) away from ALU uops. Also, scheduling uops to execution units isn't perfect (more like oldest-ready first), so uops that aren't on the critical path can steal cycles from the critical path.)

If we put a nop or an xor edx,edx into the loop, those would also issue but not execute on Intel SnB-family CPUs.

Zero-latency mov-elimination can be useful for zero-extending from 32 to 64 bits, and for 8 to 64. (movzx eax, bl is eliminated, movzx eax, bx isn't).


Without mov-elimination

All current CPUs that support mov-elimination don't support it for mov same,same, so pick different registers for zero-extending integers from 32 to 64-bit, or vmovdqa xmm,xmm to zero-extend to YMM in a rare case where that's necessary. (Unless you need the result in the register it's already in. Bouncing to a different reg and back is normally worse.) And on Intel, the same applies for movzx eax,al for example. (AMD Ryzen doesn't mov-eliminate movzx.) Agner Fog's instruction tables show mov as always being eliminated on Ryzen, but I guess he means that it can't fail between two different regs the way it can on Intel.

We can use this limitation to create a micro-benchmark that defeats it on purpose.

mov ecx, ecx      # CPUs can't eliminate  mov same,same
lea ecx, [rcx-1]

dec ecx
jnz .loop

 3,000,320,972      cycles:u                  #    3.898 GHz                      ( +-  0.00% )
 4,000,000,238      instructions:u            #    1.33  insn per cycle           ( +-  0.00% )
 1,000,000,234      branches:u                # 1299.225 M/sec                    ( +-  0.00% )
 3,000,084,446      uops_issued_any:u         # 3897.783 M/sec                    ( +-  0.00% )
 3,000,058,661      uops_executed_thread:u    # 3897.750 M/sec                    ( +-  0.00% )

This takes 3G cycles for 1G iterations, because the length of the dependency chain is now 3 cycles.

The fused-domain uop count didn't change, still 3G.

What did change is that now the unfused-domain uop count is the same as fused-domain. All the uops needed an execution unit; none of the mov instructions were eliminated, so they all added 1c latency to the loop-carried dep chain.

(When there are micro-fused uops, like add eax, [rsi], the uops_executed count can be higher than uops_issued. But we don't have that.)


Without the mov at all:

lea ecx, [rcx-1]

dec ecx
jnz .loop


 2,000,131,323      cycles:u                  #    3.896 GHz                      ( +-  0.00% )
 3,000,000,161      instructions:u            #    1.50  insn per cycle         
 1,000,000,157      branches:u                # 1947.876 M/sec                  
 2,000,055,428      uops_issued_any:u         # 3895.859 M/sec                    ( +-  0.00% )
 2,000,039,061      uops_executed_thread:u    # 3895.828 M/sec                    ( +-  0.00% )

Now we're back down to 2 cycle latency for the loop-carried dep chain.

Nothing is eliminated.


I tested on a 3.9GHz i7-6700k Skylake. I get identical results on a Haswell i5-4210U (to within 40k out of 1G counts) for all the perf events. That's about the same margin of error as re-running on the same system.

Note that if I ran perf as root1, and counted cycles instead of cycles:u (user-space only), it measure the CPU frequency as exactly 3.900 GHz. (IDK why Linux only obeys the bios-settings for max turbo right after reboot, but then drops to 3.9GHz if I leave it idle for a couple minutes. Asus Z170 Pro Gaming mobo, Arch Linux with kernel 4.10.11-1-ARCH. Saw the same thing with Ubuntu. Writing balance_performance to each of /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference from /etc/rc.local fixes it, but writing balance_power makes it drop back to 3.9GHz again later.)

1: update: as a better alternative to running sudo perf, I set sysctl kernel.perf_event_paranoid = 0 in /etc/syctl.d/99-local.conf


You should get the same results on AMD Ryzen, since it can eliminate integer mov. AMD Bulldozer-family can only eliminate xmm register copies. (According to Agner Fog, ymm register copies are an eliminated low-half and an ALU op for the high half.)

For example, AMD Bulldozer and Intel Ivybridge can sustain a throughput of 1 per clock for

 movaps  xmm0, xmm1
 movaps  xmm2, xmm3
 movaps  xmm4, xmm5
 dec
 jnz .loop

But Intel Sandybridge can't eliminate moves, so it would bottleneck on 4 ALU uops for 3 execution ports. If it was pxor xmm0,xmm0 instead of movaps, SnB could also sustain one iteration per clock. (But Bulldozer-family couldn't, because xor-zeroing still needs an execution unit on AMD, even though is independent of the old value of the register. And Bulldozer-family only has 0.5c throughput for PXOR.)


Limitations of mov-elimination

Two dependent MOV instructions in a row exposes a difference between Haswell and Skylake.

.loop:
  mov eax, ecx
  mov ecx, eax

  sub ecx, 2
  jnz .loop

Haswell: minor run-to-run variability (1.746 to 1.749 c / iter), but this is typical:

 1,749,102,925      cycles:u                  #    2.690 GHz                    
 4,000,000,212      instructions:u            #    2.29  insn per cycle         
 1,000,000,208      branches:u                # 1538.062 M/sec                  
 3,000,079,561      uops_issued_any:u         # 4614.308 M/sec                  
 1,746,698,502      uops_executed_core:u      # 2686.531 M/sec                  
   745,676,067      lsd_cycles_4_uops:u       # 1146.896 M/sec                  
  

Not all the MOV instructions are eliminated: about 0.75 of the 2 per iteration used an execution port. Every MOV that executes instead of being eliminated adds 1c of latency to the loop-carried dep chain, so it's not a coincidence that uops_executed and cycles are very similar. All the uops are part of a single dependency chain, so there's no parallelism possible. cycles is always about 5M higher than uops_executed regardless of run-to-run variation, so I guess there are just 5M cycles being used up somewhere else.

Skylake: more stable than HSW results, and more mov-elimination: only 0.6666 MOVs of every 2 needed an execution unit.

 1,666,716,605      cycles:u                  #    3.897 GHz
 4,000,000,136      instructions:u            #    2.40  insn per cycle
 1,000,000,132      branches:u                # 2338.050 M/sec
 3,000,059,008      uops_issued_any:u         # 7014.288 M/sec
 1,666,548,206      uops_executed_thread:u    # 3896.473 M/sec
   666,683,358      lsd_cycles_4_uops:u       # 1558.739 M/sec

On Haswell, lsd.cycles_4_uops accounted for all of the uops. (0.745 * 4 ~= 3). So in almost every cycle where any uops are issued, a full group of 4 is issued (from the loop-buffer. I probably should have looked at a different counter that doesn't care where they came from, like uops_issued.stall_cycles to count cycles where no uops issued).

But on SKL, 0.66666 * 4 = 2.66664 is less than 3, so in some cycles the front-end issued fewer than 4 uops. (Usually it stalls until there's room in the out-of-order back-end to issue a full group of 4, instead of issuing non-full groups).

It's weird, IDK what the exact microarchitectural limitation is. Since the loop is only 3 uops, each issue-group of 4 uops is more than a full iteration. So an issue group can contain up to 3 dependent MOVs. Perhaps Skylake is designed to break that up sometimes, to allow more mov-elimination?

update: actually this is normal for 3-uop loops on Skylake. uops_issued.stall_cycles shows that HSW and SKL issue a simple 3 uop loop with no mov-elimination the same way they issue this one. So better mov-elimination is a side-effect of splitting up issue groups for some other reason. (It's not a bottleneck because taken branches can't execute faster than 1 per clock regardless of how fast they issue). I still don't know why SKL is different, but I don't think it's anything to worry about.


In a less extreme case, SKL and HSW are the same, with both failing to eliminate 0.3333 of every 2 MOV instructions:

.loop:
  mov eax, ecx
  dec eax
  mov ecx, eax

  sub ecx, 1
  jnz .loop
 2,333,434,710      cycles:u                  #    3.897 GHz                    
 5,000,000,185      instructions:u            #    2.14  insn per cycle         
 1,000,000,181      branches:u                # 1669.905 M/sec                  
 4,000,061,152      uops_issued_any:u         # 6679.720 M/sec                  
 2,333,374,781      uops_executed_thread:u    # 3896.513 M/sec                  
 1,000,000,942      lsd_cycles_4_uops:u       # 1669.906 M/sec                  

All of the uops issue in groups of 4. Any contiguous group of 4 uops will contain exactly two MOV uops that are candidates for elimination. Since it clearly succeeds at eliminating both in some cycles, IDK why it can't always do that.


Intel's optimization manual says that overwriting the result of mov-elimination as early as possible frees up the microarchitectural resources so it can succeed more often, at least for movzx. See Example 3-23. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions.

So maybe it's tracked internally with a limited-size table of ref-counts? Something has to stop the physical register file entry from being freed when it's no longer needed as the value of the original architectural register, if it's still needed as the value of the mov destination. Freeing PRF entries as soon as possible is key, because PRF size can limit the out-of-order window to smaller than ROB size.

I tried the examples on Haswell and Skylake, and found that mov-elimination did in fact work significantly more of the time when doing that, but that it was actually slightly slower in total cycles, instead of faster. The example was intended to show the benefit on IvyBridge, which probably bottlenecks on its 3 ALU ports, but HSW/SKL only bottleneck on resource conflicts in the dep chains and don't seem to be bothered by needing an ALU port for more of the movzx instructions.

See also Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for more research + guesswork about how mov-elimination works, and whether it could work for xchg eax, ecx. (In practice xchg reg,reg is 3 ALU uops on Intel, but 2 eliminated uops on Ryzen. It's interesting to guess whether Intel could have implemented it more efficiently.)


BTW, as a workaround for an erratum on Haswell, Linux doesn't provide uops_executed.thread when hyperthreading is enabled, only uops_executed.core. The other core was definitely idle the whole time, not even timer interrupts, because I took it offline with echo 0 > /sys/devices/system/cpu/cpu3/online. Unfortunately this can't be done before the kernel's perf drivers (PAPI) decides that HT is enabled on boot, and my Dell laptop doesn't have a BIOS option to disable HT. So I can't get perf to use all 8 hardware PMU counters at once on that system, only 4. :/

Pallet answered 26/5, 2017 at 4:43 Comment(17)
+1 great answer! Some of it actually went over my head (e.g. I hadn't heard of "fused-domain" before) but I think I grasped what's going on. Thanks!Tweak
@Mehrdad: Since micro-fusion is a thing, "fused-domain" is a convenient term to refer to the whole pipeline (after the decoders) except for the execution units and the scheduler that feeds them. uops_issued.any is really just counting uops as they issue into the out-of-order core (the ROB and scheduler). Since none of them are micro-fused, I could have just said that. Branch mispredicts or other mis-speculation roll-backs will cause uops_retired to be less than issued/executed.Pallet
@Mehrdad: I made an update to try to explain the loop in your question better. I thought I was onto something, but it turned out not to really explain the 2% difference at all. I didn't try running it myself with perf counters, but hopefully the answer is slightly more beginner-friendly now. (I avoided the term "fused-domain" in the early part). I think it might be worthy of being marked accepted, even though I haven't really explained the 2% diff you observed with your loop. Is my explanation of the 50% difference in my artificial example clear enough?Pallet
Yeah I'm pretty sure I understand it. You're saying dec + jnz get fused into 1 operation, and so if the mov is eliminated, you have 2 operations running every for 4 instructions, and each one takes a cycle, giving the 2.00 ins/cycle, and analogously with the 1.33 and 1.50 cases. The 2% is definitely curious, I agree. But it's a really good answer; I was going to accept it at some point, just wasn't in a hurry about it. Thanks for writing it.Tweak
@Mehrdad: yup. dec/jnz macro-fusion isn't essential, BTW. I only had to make a big deal out of it to explain the uop perf counters. If the jnz was a separate uop, the uops_issued and executed counts would be 1G higher, and the front-end would be maxed out at 4 per clock. e.g. lea ecx, [rax-1] dec ecx mov eax,ecx jnz .loop would decode to 4 uops, but only three of them would need execution units. But it would be succeptible to the 2% or whatever slowdown if dec ever steals port6 from jnz. (The LEA can only run on port 1 or 5, and the MOV doesn't need a port).Pallet
Just tested that loop without macro-fusion (with a mov between the dec and jnz). No weird 2% slowdown there, it runs exactly as fast as the 3-uop loop. But silly me, it doesn't need to max out the front-end because the bottleneck is still the 2 cycle latency of lea + dec.Pallet
«mov-elimination can be useful for zero-extending from 32 to 64 bits, and for 8 to 64.» your example is 8 to 32 bits.U
@JDługosz: movzx eax, bl is 8 to 64. The 32 -> 64 part is implicit from writing a 32-bit register (stackoverflow.com/questions/11177137/…). Writing movzx rax, bl would make the code larger (REX prefix) for no benefit.Pallet
FYI move elimination is disabled in Ice Lake with newer microcode. See erratum ICL065.Cb
@BeeOnRope: Oh, FFS Intel, test your CPUs better so we don't have to keep working around performance potholes introduced by mitigations. Especially since Intel's optimization advice for IvyBridge was to prefer overwriting the result of a mov right away to free up mov-elimination resources, making it more likely for the mov to be on the critical path without elimination. (And compilers seem to prefer doing more with the copy instead of the original after making a copy.)Pallet
@PeterCordes do you know of any special cases with rsp? ie movq rsp, r0; <some loop>... call 1f; 1: movq r0, rsp; .... to disable LSD in loops that don't need the stack. Is there anything to stop the restore in the loop from being eliminated? (feels like there must be something wrong with it...)Rancorous
@Noah: IIRC, needing a stack-sync uop from a stack op might interfere with the LSD. Or you're trying to break the LSD without a stack op like call in the loop?Pallet
@PeterCordes a stack op is required AFAIK (unbalanced push / pop or any call). Just trying to minimize the overhead (realistically using push or pop depending if I have p23 / p4 bottlenecks elsewhere and restoring rsp with movq). Just trying to evaluate the real cost i.e 1 BE + 1 FE uop or 2 BE uop.Rancorous
@Noah: Too bad Intel microcode isn't open-source; we know the LSD can be disabled by microcode, like in Skylake-family. (Of course, if you had multiple computers to choose from, you could just use an SKL with its LSD disabled by microcode, vs. one that didn't, one the assumption that they're otherwise microarchitecturally identical.)Pallet
Could even be possible on an older computer according to this paper. But now IIRC now there is some element of encrypted (maybe just a signature) so cest la vie. Also found in the "wild" the AVX benchmark is slowed down 10-20% because of the branch misses from LSD.Rancorous
I imagine the answer is no but any chance you know if masked moves (aka vmovdqu8 ymm0, ymm1 {k1}{z}) can eliminate?Rancorous
@Noah: I'm sure they can't be eliminated, thanks to uops.info/'s testing of masked vs. unmasked mov intructions. (Note that the masking goes on the destination, which is the first operand in Intel syntax.)Pallet
G
12

Here are two small tests that I believe conclusively show evidence for mov-elimination:

__loop1:
    add edx, 1
    add edx, 1
    add ecx, 1
    jnc __loop1

versus

__loop2:
    mov eax, edx
    add eax, 1
    mov edx, eax
    add edx, 1
    add ecx, 1
    jnc __loop2

If mov added a cycle to a dependency chain, it would be expected that the second version takes about 4 cycles per iteration. On my Haswell, both take about 2 cycles per iteration, which cannot happen without mov-elimination.

Gifford answered 24/5, 2017 at 22:40 Comment(17)
Hm, how is this supposed to be measuring something different from my example? I'm also on Haswell (4700MQ) and I'm seeing the same 2% difference (2650 ms compared to 2700 ms)...Tweak
@Mehrdad because the movs are now in the dependency chain, so if they had a latency it would have to add up. In your test case, the mov is just sort of dangling at the end of chain, nothing is waiting for it to happen. It might be eliminated or not, there is no way to tell.Gifford
I mean it certainly can't be harder to eliminate it in my case, right? And regardless, I don't see any difference... so is it fair to say MOV is never eliminated entirely? (If so you might want to address the other parts of the question, given this current answer doesn't really add anything currently.)Tweak
@Mehrdad oh it's probably eliminated, but even without elimination you would barely notice that mov in the timing since nothing depends on it and there will be plenty of throughput to just execute it, dare I say, "for free" (close enough, right?). As for mov never being eliminated entirely, I don't think that's a fair conclusion to draw - it has 0 latency, what more would you want?Gifford
I mean, I still don't see it having 0 of anything. You might say it has 0.02 latency, or maybe 500x throughput, or something like that, but 0 latency? How do you look those timings that consistently show a difference of 50 ms and see 0 of anything? I just don't get where you're pulling the zero from.Tweak
@Mehrdad the timings are different, yes. But latency can only ever (inb4 Netburst with its weird dual-pumped ALU) be an integer number of cycles, so mov either adds a cycle or it doesn't (in which case it must have been eliminated). That its mere presence has other (more subtle) effects, is really unrelated. You are absolutely right of course that those effects do exist.Gifford
I see. But I assume you only say latency is an integer number of cycles because you're assuming the CPU doesn't do the kind of optimization I just mentioned above, right? Like if it just decides to skip an instruction (it shouldn't be so hard to do this for a loop of NOPs at least, if nothing else) then the average latency of the NOP would be far less than 1 cycle but still positive, since it would run the first time and every time there is a context switch.Tweak
@Mehrdad the thing about latency is that it is the time it takes from when the instruction is issued to when a dependent instruction can be issued (a mildly fancy way to say "how long it takes", but the difference matters). NOP doesn't work like that, it depends on nothing and nothing depends on it, so it has no latency (you can measure the throughput still). It may help to look at this image. With any sort of conventional architecture (any recent x86 is known to be like this too) that will always be a (whole) number of cycles.Gifford
I see. What about mov eax, eax? Does it have a latency? If so, what if you pretend this is what I wrote instead of NOP in the previous comment? Wouldn't that still lower the average latency below 1 instruction?Tweak
@Mehrdad that's getting into odd cases a bit since it depends on how it is implemented, at least it is possible to try to measure it since it notionally reads something and writes something. Actually doing that (eg by adapting the code from my second test case) shows its latency to be 1 on Haswell (ie it is not eliminated). I can't think of a reason for that off the top of my head but that's how it isGifford
Right, but if hypothetically Haswell was smart enough to eliminate it on subsequent iterations of the loop, then wouldn't that imply its average latency is lower than 1 cycle? Like I'm asking more about the definition of latency here (and whether it can on average be less than 1 cycle) rather than about the particular implementation.Tweak
And if mov eax, eax really bothers you for this example, you can pretend I said xchg eax, ebx; xchg ebx, eax... basically any sequence of effective no-ops that a sufficiently-smart CPU might optimize out on subsequent runs.Tweak
@Mehrdad oh sorry yes, an average latency can be a non-integer. Under the hypothesis that what is happening is occasional failure to eliminate the mov, you might even say that the latency is on average some low but non-zero number. AFAIK it's just due to other effects but it's always worth a try. E: for example if the consistent small penalty for my second example changes significantly if "other harmless junk" is put in there instead of movs, that might indicate something interesting in that direction.Gifford
Ok thanks. Regarding your comment: I don't think this is an occasional failure to eliminate the mov, I think it's happening too frequently for that (2% of the time is a lot), so my conclusion is it's probably never eliminated completely. Though then again, 2% is too low for me to guess what might be actually taking 1/50th of a clock cycle on average...Tweak
are you running this baremetal? with or without caches enabled? you adjust the fetch alignment through at least 16 if not 32 bytes?Rondi
See also my comments on the discussion that inspired this question. I used perf counters to see that the total cycle count matched exactly with the number of mov instructions that weren't eliminated (by looking at uops_executed.thread). SKL can do a somewhat better job of handling two dependent renames per clock.Pallet
@harold: I came up with an even cleaner test, see my answer. It runs at the expected speeds to within measurement error (better than 1 part in 10000 for 1G iterations).Pallet

© 2022 - 2024 — McMap. All rights reserved.