Why would introducing useless MOV store instructions speed up a tight loop in x86_64 assembly?
Asked Answered
E

4

236

Background:

While optimizing some Pascal code with embedded assembly language, I noticed an unnecessary MOV instruction, and removed it.

To my surprise, removing the un-necessary instruction caused my program to slow down.

I found that adding arbitrary, useless MOV instructions increased performance even further.

The effect is erratic, and changes based on execution order: the same junk instructions transposed up or down by a single line produce a slowdown.

I understand that the CPU does all kinds of optimizations and streamlining, but, this seems more like black magic.

The data:

A version of my code conditionally compiles three junk operations in the middle of a loop that runs 2**20==1048576 times. (The surrounding program just calculates SHA-256 hashes).

The results on my rather old machine (Intel(R) Core(TM)2 CPU 6400 @ 2.13 GHz):

avg time (ms) with -dJUNKOPS: 1822.84 ms
avg time (ms) without:        1836.44 ms

The programs were run 25 times in a loop, with the run order changing randomly each time.

Excerpt:

{$asmmode intel}
procedure example_junkop_in_sha256;
  var s1, t2 : uint32;
  begin
    // Here are parts of the SHA-256 algorithm, in Pascal:
    // s0 {r10d} := ror(a, 2) xor ror(a, 13) xor ror(a, 22)
    // s1 {r11d} := ror(e, 6) xor ror(e, 11) xor ror(e, 25)
    // Here is how I translated them (side by side to show symmetry):
  asm
    MOV r8d, a                 ; MOV r9d, e
    ROR r8d, 2                 ; ROR r9d, 6
    MOV r10d, r8d              ; MOV r11d, r9d
    ROR r8d, 11    {13 total}  ; ROR r9d, 5     {11 total}
    XOR r10d, r8d              ; XOR r11d, r9d
    ROR r8d, 9     {22 total}  ; ROR r9d, 14    {25 total}
    XOR r10d, r8d              ; XOR r11d, r9d

    // Here is the extraneous operation that I removed, causing a speedup
    // s1 is the uint32 variable declared at the start of the Pascal code.
    //
    // I had cleaned up the code, so I no longer needed this variable, and 
    // could just leave the value sitting in the r11d register until I needed
    // it again later.
    //
    // Since copying to RAM seemed like a waste, I removed the instruction, 
    // only to discover that the code ran slower without it.
    {$IFDEF JUNKOPS}
    MOV s1,  r11d
    {$ENDIF}

    // The next part of the code just moves on to another part of SHA-256,
    // maj { r12d } := (a and b) xor (a and c) xor (b and c)
    mov r8d,  a
    mov r9d,  b
    mov r13d, r9d // Set aside a copy of b
    and r9d,  r8d

    mov r12d, c
    and r8d, r12d  { a and c }
    xor r9d, r8d

    and r12d, r13d { c and b }
    xor r12d, r9d

    // Copying the calculated value to the same s1 variable is another speedup.
    // As far as I can tell, it doesn't actually matter what register is copied,
    // but moving this line up or down makes a huge difference.
    {$IFDEF JUNKOPS}
    MOV s1,  r9d // after mov r12d, c
    {$ENDIF}

    // And here is where the two calculated values above are actually used:
    // T2 {r12d} := S0 {r10d} + Maj {r12d};
    ADD r12d, r10d
    MOV T2, r12d

  end
end;

Try it yourself:

The code is online at GitHub if you want to try it out yourself.

My questions:

  • Why would uselessly copying a register's contents to RAM ever increase performance?
  • Why would the same useless instruction provide a speedup on some lines, and a slowdown on others?
  • Is this behavior something that could be exploited predictably by a compiler?
Edmonton answered 27/7, 2013 at 10:25 Comment(11)
There are all sorts of 'useless' instructions that can actually serve to break dependency chains, mark physical registers as retired, etc. Exploiting these operations requires some knowledge of the microarchitecture. Your question should provide a short sequence of instructions as a minimal example, rather than directing people to github.Curriculum
@BrettHale good point, thanks. I added a code excerpt with some commentary. Would copying a register's value to ram mark the register as retired, even if the value in it is used later?Edmonton
try to exchange the order of "mov r8d, a" and "mov r9d, b" without "mov s1, r11d". r8d is used in the instruction "xor r10d, r8d" and "mov r8d, something" can not be executed before the end of the "xor".Carolinecarolingian
Can you put the standard deviation on those averages? There's no actual indication in this post that there's a real difference.Havstad
Can you please try timing the instructions using the rdtscp instruction, and check the clock cycles for both versions?Sixth
Since both useless instructions are writing to register s1 you have a write after write dependency and the instruction will get throw out when the processor sees the second write writing to s1 in the queue. My guess was that it is putting the value in the queue so that it can be forwarded rather than fetched from the register file, but I don't think that's the case here.Echolocation
I believe there are 'magic' numbers in computing, like 4, 16, 128 and other exponents of 2. I think that these useless instructions may complete these numbers, so they are more suitable for computing operations. -- However, that's so low level.Spiritualty
@starwed: I believe you mean standard error, not standard deviation, though they're closely related. And yes, he should really, really be including some more details about the distribution, not just standard error, but min/max/median/quartiles too. Also, 25 runs just sounds a little low, to me, for such a small difference.Pascia
I had a similar question a few years ago: stackoverflow.com/questions/688325/…. Unfortunately the last time I tried to repeat those results I couldn't do it.Distichous
Can it also be due to memory alignment? I didn't do the math myself (lazy :P) but adding some dummy instructions can cause your code do be memory aligned...Headwards
Do any of real world JIT compilers exploit such techniques? (for example, mainstream JVMs, Microsoft .NET, maybe something else)Reproachful
O
153

The most likely cause of the speed improvement is that:

  • inserting a MOV shifts the subsequent instructions to different memory addresses
  • one of those moved instructions was an important conditional branch
  • that branch was being incorrectly predicted due to aliasing in the branch prediction table
  • moving the branch eliminated the alias and allowed the branch to be predicted correctly

Your Core2 doesn't keep a separate history record for each conditional jump. Instead it keeps a shared history of all conditional jumps. One disadvantage of global branch prediction is that the history is diluted by irrelevant information if the different conditional jumps are uncorrelated.

This little branch prediction tutorial shows how branch prediction buffers work. The cache buffer is indexed by the lower portion of the address of the branch instruction. This works well unless two important uncorrelated branches share the same lower bits. In that case, you end-up with aliasing which causes many mispredicted branches (which stalls the instruction pipeline and slowing your program).

If you want to understand how branch mispredictions affect performance, take a look at this excellent answer: https://mcmap.net/q/14898/-why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array

Compilers typically don't have enough information to know which branches will alias and whether those aliases will be significant. However, that information can be determined at runtime with tools such as Cachegrind and VTune.

Ollieollis answered 28/7, 2013 at 8:52 Comment(5)
Hmm. This sounds promising. The only conditional branches in this sha256 implementation are the checks for the end of the FOR loops. At the time, I had tagged this revision as an oddity in git and continued optimizing. One of my next steps was to rewrite the pascal FOR loop myself in assembly, at which point these extra instructions no longer had a positive effect. Perhaps free pascal's generated code was harder for the processor to predict than the simple counter I replaced it with.Edmonton
@Edmonton That sounds like a good summary. The branch prediction table isn't very large, so one table entry might refer to more than one branch. This can render some predictions useless. The problem is easily fixed if one of the conflicting branches moves to another part of the table. Almost any little change can make this happen :-)Ollieollis
I think this is the most reasonable explanation of the specific behavior I observed, so I'm going to mark this as the answer. Thanks. :)Edmonton
There is an absolutely excellent discussion of a similar problem one of the contributers to Bochs ran into, you may want to add this to your answer: emulators.com/docs/nx25_nostradamus.htmUlphiah
Insn alignment matters for a lot more than just branch targets. Decode bottlenecks are a huge issue for Core2 and Nehalem: it often has a hard time keeping its execution units busy. Sandybridge's introduction of the uop cache increased frontend throughput a huge amount. Aligning branch targets is done because of this issue, but it affects all code.Betray
D
82

You may want to read http://research.google.com/pubs/pub37077.html

TL;DR: randomly inserting nop instructions in programs can easily increase performance by 5% or more, and no, compilers cannot easily exploit this. It's usually a combination of branch predictor and cache behaviour, but it can just as well be e.g. a reservation station stall (even in case there are no dependency chains that are broken or obvious resource over-subscriptions whatsoever).

Downstage answered 27/7, 2013 at 11:33 Comment(4)
Interesting. But is the processor (or FPC) smart enough to see that writing to ram is a NOP in this case?Edmonton
Assembler is not optimized.Preengage
Compilers could exploit it by doing incredibly expensive optimizations like repeatedly building and profiling and then varying the compiler output with a simulated annealing or genetic algorithm. I've read about some work in that area. But we're talking a minimum of 5-10 minutes of 100% CPU to compile, and the resulting optimizations would probably be CPU core model and even core or microcode revision specific.Lui
I wouldn't call it random NOP, they explain why NOPs can have a positive effect on the performance (tl;dr: https://mcmap.net/q/14899/-why-does-gcc-output-machine-code-have-nop-instructions) and the random insertion of NOP did result in performance degradation. What is interesting of the paper is that the removal of 'strategic' NOP by GCC had no effect on performance overall!Herrmann
J
15

I believe in modern CPUs the assembly instructions, while being the last visible layer to a programmer for providing execution instructions to a CPU, actually are several layers from actual execution by the CPU.

Modern CPUs are RISC/CISC hybrids that translate CISC x86 instructions into internal instructions that are more RISC in behavior. Additionally there are out-of-order execution analyzers, branch predictors, Intel's "micro-ops fusion" that try to group instructions into larger batches of simultaneous work (kind of like the VLIW/Itanium titanic). There are even cache boundaries that could make the code run faster for god-knows-why if it's bigger (maybe the cache controller slots it more intelligently, or keeps it around longer).

CISC has always had an assembly-to-microcode translation layer, but the point is that with modern CPUs things are much much much more complicated. With all the extra transistor real estate in modern semiconductor fabrication plants, CPUs can probably apply several optimization approaches in parallel and then select the one at the end that provides the best speedup. The extra instructions may be biasing the CPU to use one optimization path that is better than others.

The effect of the extra instructions probably depends on the CPU model / generation / manufacturer, and isn't likely to be predictable. Optimizing assembly language this way would require execution against many CPU architecture generations, perhaps using CPU-specific execution paths, and would only be desirable for really really important code sections, although if you're doing assembly, you probably already know that.

Jobye answered 27/7, 2013 at 17:59 Comment(8)
Your answer is kind of confusing. In many places it seems like you are guessing, although most of what you say is correct.Directive
@Directive FWIW, I found the answer entirely comprehensible (even though the grammar wasn't perfect and the style awkward at times).Omniscience
Maybe I should clarify. What I find confusing is the lack of certaintyDirective
Mentally remove "I believe" from the beginning and there is no lack of certainty anymore. I'd submit an edit to that effect, but I don't have the technical knowledge to know if certainty is warranted.Spector
guessing that makes sense and with good argumentation is completely valid.Lophophore
No-one can really know for sure why the OP is observing this strange behavior, unless it was an engineer at Intel who had access to special diagnostic equipment. So all others can do is guess. That isn't @cowarldlydragon's fault.Demogorgon
Downvote; none of what you say explains the behaviour OP is seeing. Your answer is useless.Quark
Some of these guesses are wrong. "CPUs ... apply several optimization approaches in parallel..." nope. Things in the OOO core are pretty deterministic. "maybe the cache controller slots it more intelligently": nope, there's no way in which that guess matches reality. The only way in which answer this right is that it's saying "because modern OOO cores are complex, and there are interactions between nearby instructions". See the microarch pdf at agner.org/optimize to find out what those are, specifically. In this case, it's most likely an alignment -> decoder issue, since Core2...Betray
P
0

Preparing the cache

Move operations to memory can prepare the cache and make subsequent move operations faster. A CPU usually have two load units and one store units. A load unit can read from memory into a register (one read per cycle), a store unit stores from register to memory. There are also other units that do operations between registers. All the units work in parallel. So, on each cycle, we may do several operations at once, but no more than two loads, one store, and several register operations. Usually it is up to 4 simple operations with plain registers, up to 3 simple operations with XMM/YMM registers and a 1-2 complex operations with any kind of registers. Your code has lots of operations with registers, so one dummy memory store operation is free (since there are more than 4 register operations anyway), but it prepares memory cache for the subsequent store operation. To find out how memory stores work, please refer to the Intel 64 and IA-32 Architectures Optimization Reference Manual.

Breaking the false dependencies

Although this does not exactly refer to your case, but sometimes using 32-bit mov operations under the 64-bit processor (as in your case) are used to clear the higher bits (32-63) and break the dependency chains.

It is well known that under x86-64, using 32-bit operands clears the higher bits of the 64-bit register. Pleas read the relevant section - 3.4.1.1 - of The Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1:

32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register

So, the mov instructions, that may seem useless at the first sight, clear the higher bits of the appropriate registers. What it gives to us? It breaks dependency chains and allows the instructions to execute in parallel, in random order, by the Out-of-Order algorithm implemented internally by CPUs since Pentium Pro in 1995.

A Quote from the Intel® 64 and IA-32 Architectures Optimization Reference Manual, Section 3.5.1.8:

Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core micro-architecture, a number of instructions can help clear execution dependency when software uses these instruction to clear register content to zero. Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.

Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.

The MOVZX and MOV with 32-bit operands for x64 are equivalent - they all break dependency chains.

That's why your code executes faster. If there are no dependencies, the CPU can internally rename the registers, even though at the first sight it may seem that the second instruction modifies a register used by the first instruction, and the two cannot execute in parallel. But due to register renaming they can.

Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.

I think you now see that it is too obvious.

Plunger answered 14/5, 2017 at 14:41 Comment(5)
This is all true, but has nothing to do with the code presented in the question.Religionism
@CodyGray - thank you for your feedback. I have edited the reply and added a chapter about the case - that mov to memory surrounded by register operations prepares the cache and it's free since the store unit is idle anyway. So the subsequent store operation will be faster.Plunger
there's no MOVZX for 32-bit operands, because all instructions with 32-bit destination zero the upper part of the full 64-bit registerSparker
@phuclv: I suspect this answer is talking about movzx with a 32-bit destination. Intel's phrasing is (use movzx) or (use 32-bit mov), and that's a valid way to parse Maxim's "(movzx) and (mov with 32-bit operands)". BTW, MOVZX missing 32 bit register to 64 bit register is a more specific Q&A.Betray
and a 1-2 complex operations with any kind of registers - The pipeline is only 4 uops wide in Intel from Core 2 to Skylake. And SIMD ALU ops compete for the same execution ports as simple and complex integer, so no, you can't do 4 add + 3 pand even momentarily on Alder Lake. And the OP's CPU is first-gen Core 2, so the front-end is often a major bottleneck, achieving 4 uops/clock is hard, with decode and register-read bottlenecks. Also only 1 load + 1 store per clock, but yes those can be free if the front-end bandwidth and ROB (reorder buffer) capacity isn't the limiting factor.Betray

© 2022 - 2024 — McMap. All rights reserved.