Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?
Asked Answered
S

1

20

I'm doing micro-optimization on a performance critical part of my code and came across the sequence of instructions (in AT&T syntax):

add %rax, %rbx
mov %rdx, %rax
mov %rbx, %rdx

I thought I finally had a use case for xchg which would allow me to shave an instruction and write:

add  %rbx, %rax
xchg %rax, %rdx

However, to my dimay I found from Agner Fog's instruction tables, that xchg is a 3 micro-op instruction with a 2 cycle latency on Sandy Bridge, Ivy Bridge, Broadwell, Haswell and even Skylake. 3 whole micro-ops and 2 cycles of latency! The 3 micro-ops throws off my 4-1-1-1 cadence and the 2 cycle latency makes it worse than the original in the best case since the last 2 instructions in the original might execute in parallel.

Now... I get that the CPU might be breaking the instruction into micro-ops that are equivalent to:

mov %rax, %tmp
mov %rdx, %rax
mov %tmp, %rdx 

where tmp is an anonymous internal register and I suppose the last two micro-ops could be run in parallel so the latency is 2 cycles.

Given that register renaming occurs on these micro-architectures, though, it doesn't make sense to me that this is done this way. Why wouldn't the register renamer just swap the labels? In theory, this would have a latency of only 1 cycle (possibly 0?) and could be represented as a single micro-op so it would be much cheaper.

Spireme answered 19/8, 2017 at 0:5 Comment(19)
On Zen it's a two-ops instruction with zero latency. Also note how fxch is faster than xchg on Intel, so it seems that exchange operations are not impossible to optimize. Perhaps Intel just didn't see the need to make this fast?Corkage
Yeah, I remember reading from Agner Fog's micro-architecture documents that fxch has been a pure register renaming instruction since before the P4, which led me to believe they had done this for the general purpose registers too, especially since register moves are also zero latency ops on the newer processors. There is also implication that there was specific pressure from users of the floating point stack for fxch to be cheap.Spireme
xchg reg, reg is a rare type of instruction that has two general purpose outputs. From the top of my head, only imul/mul, div, pop, xadd, cmpxchg8/16b and some string operations do this. With all of them except xchg and xadd, they are either naturally slow (div) or at least naturally produce their result in different data paths (pop) and/or with different latencies (mul). If nearly all instructions only need one result data path, it would be a waste to design a CPU that offers two low-latency data paths for a rare use of xchg.Denature
@Spireme Very interesting link, thank you!Corkage
@jeteon: fxch is hard to avoid because of the stack nature of x87. Unlike xchg, having fast fxch is important for performance in most pre-SSE floating-point code. xchg is usually easy to avoid. In most cases, you can just unroll a loop so it's ok that the same value is now in a different register. e.g. Fibonacci with add rax, rdx / add rdx, rax instead of add rax, rdx / xchg rax, rdx.Enesco
"4-1-1-1 cadence"? Sandybridge-family decoders are different from Core2/Nehalem. They can produce up to 4 uops total, not 7, so the patterns are 1-1-1-1, 2-1-1, 3-1, or 4. (Skylake has an extra "simple" decoder, but >4 uops for one instruction still requires the microcode ROM.)Enesco
Also, does your code usually have to run from the decoders instead of the uop cache (DSB in Intel's terminology)? All the uops for an instruction have to go into the same "line" of the uop cache, but the patterns are different and multi-uop instructions have different effects. You don't usually have to optimize for decode for SnB-family, at least for anything that runs as part of a small to medium loop. (Although you can have code that won't go into the uop cache because of more than 3 lines per 32B of x86 code.)Enesco
@EOF: Good point. Every instruction that produces multiple GP register outputs decodes to multiple uops on Intel P6 / SnB-family, including the one-operand form of mul. (Flags don't count, and not touching flags doesn't enable mulx to produce 2 integer outputs with one uop. mulx is 2 uops on HSW/SKL).Enesco
@Denature - what do you mean by mul producing it's results with different latencies? That the latency to the bottom half result is different than the top half?Zennie
@Zennie You naturally get the lower half of the result with a shorter latency from a multiply unit. On ARMv7, you can actually see then different latencies, AFAIK. On x86 implementations I don't think you can see this in the latency itself, but since the lower result is produced faster in the multiply unit, it's possible that it takes a slower data path to have the same total latency as the upper result.Denature
Yeah, I was a bit confused about how the longer latency helps, since AFAIK the real solution is that two-output instructions just have at least two uops as Peter mentions above. So there is no real data path limitation because now you have two ops to spread the result writes across. Said another way, you should count outputs in the uop domain, not in the instruction domain to determine if multiple results are possible at the hardware level. @eofZennie
@PeterCordes, I'm literally searching for ~1% speed bumps so hand optimization has been working out on the main loop code. Unfortunately that's ~18kB of code so I'm not even trying to consider the uop cache anymore. Regarding the 4-1-1-1, I wasn't sure I was reading how that worked right but what you say makes sense and just changed so much of how I'm going about things.Spireme
@EOF, it makes sense that without special support at the RAT, the instruction would be broken up into uops and that explains the two cycle latency. I suppose its basically "emulating" support for an obscure instruction. I just thought that with the switch to using a PRF, it would be a matter of changing labels but I suppose I've swallowed the marketing hype about just how "easy" that would be.Spireme
As an aside, with a mul %rbx do you really get %rdx and %rax all at once or does the ROB technically have access to the lower part of the result one cycle earlier than the higher part? Or is it like the "mul" uop goes into the multiplication unit and then the multiplication unit issues two uops straight into the ROB to write the result at the end?Spireme
Your question says AT&T syntax, but you left off the % on register names. Are your operands in destination-last order, or is this fully Intel-syntax with add dst, src?Enesco
BTW, "why" - because C compilers don't use xchg for anything except atomic lock synchronization in multi-thread, or maybe few other special cases. So there was no reason to make it optimized in modern x86. You don't need it, if you have mov and enough spare registers, and you need that reg allocation logic in compiler any way, exchange is just special case (something about how "everything looks as nail, once you have hammer in hand").Uncircumcision
@PeterCordes, it is add dst, src. I've since fixed it in an edit, thanks. I guess the somewhat subjective nature of the question makes this hard to determine but a lot of the above comments seem like they could just as well be answers on their own, with their own comments rather than notes directly tied to the question.Spireme
I'm working on an answer. TL:DR: nothing is as easy as you'd think based on the simple mental model of the uarch that we use when optimizing for it. Haswell even dropped single-uop fxch support; it's now 2 uops, so clearly there was some special hardware support required that was worth eliminating for power / complexity reasons as x87 becomes less and less relevant.Enesco
Btw, regarding the mul question above, if anyone is interested, I just came across this in the Intel Optimization Manual: "In Intel microarchitecture code name Sandy Bridge, the low 64-bit result of a 128-bit multiply is ready to use in 3 cycles and the high 64-bit result is ready one cycle after the low 64-bit result. This can speed up the calculation of integer multiply and division of large integers."Spireme
E
29

Supporting efficient xchg is non-trivial, and presumably not worth the extra complexity it would require in various parts of the CPU. A real CPU's microarchitecture is much more complicated than the mental model that you can use while optimizing software for it. For example, speculative execution makes everything more complicated, because it has to be able to roll back to the point where an exception occurred.

Making fxch efficient was important for x87 performance because the stack nature of x87 makes it (or alternatives like fld st(2)) hard to avoid. Compiler-generated FP code (for targets without SSE support) really does use fxch a significant amount. It seems that fast fxch was done because it was important, not because it's easy. Intel Haswell even dropped support for single-uop fxch. It's still zero-latency, but decodes to 2 uops on HSW and later (up from 1 in P5, and PPro through IvyBridge).

xchg is usually easy to avoid. In most cases, you can just unroll a loop so it's ok that the same value is now in a different register. e.g. Fibonacci with add rax, rdx / add rdx, rax instead of add rax, rdx / xchg rax, rdx. Compilers generally don't use xchg reg,reg, and usually hand-written asm doesn't either. (This chicken/egg problem is pretty similar to loop being slow (Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?). loop would have been very useful for for adc loops on Core2/Nehalem where an adc + dec/jnz loop causes partial-flag stalls.)

Since xchg is still slow-ish on previous CPUs, compilers wouldn't start using it with -mtune=generic for several years. Unlike fxch or mov-elimination, a design-change to support fast xchg wouldn't help the CPU run most existing code faster, and would only enable performance gains over the current design in rare cases where it's actually a useful peephole optimization.


Integer registers are complicated by partial-register stuff, unlike x87

There are 4 operand sizes of xchg, 3 of which use the same opcode with REX or operand-size prefixes. (xchg r8,r8 is a separate opcode, so it's probably easier to make the decoders decode it differently from the others). The decoders already have to recognize xchg with a memory operand as special, because of the implicit lock prefix, but it's probably less decoder complexity (transistor-count + power) if the reg-reg forms all decode to the same number of uops for different operand sizes.

Making some r,r forms decode to a single uop would be even more complexity, because single-uop instructions have to be handled by the "simple" decoders as well as the complex decoder. So they would all need to be able to parse xchg and decide whether it was a single uop or multi-uop form.


AMD and Intel CPUs behave somewhat similarly from a programmer's perspective, but there are many signs that the internal implementation is vastly different. For example, Intel mov-elimination only works some of the time, limited by some kind of microarchitectural resources, but AMD CPUs that do mov-elimination do it 100% of the time (e.g. Bulldozer for the low lane of vector regs).

See Intel's optimization manual, Example 3-23. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions, where they discuss overwriting the zero-latency-movzx result right away to free up the internal resource sooner. (I tried the examples on Haswell and Skylake, and found that mov-elimination did in fact work significantly more of the time when doing that, but that it was actually slightly slower in total cycles, instead of faster. The example was intended to show the benefit on IvyBridge, which probably bottlenecks on its 3 ALU ports, but HSW/SKL only bottleneck on resource conflicts in the dep chains and don't seem to be bothered by needing an ALU port for more of the movzx instructions.)

I don't know exactly what needs tracking in a limited-size table(?) for mov-elimination. Probably it's related to needing to free register-file entries as soon as possible when they're no longer needed, because Physical Register File size limits rather than ROB size can be the bottleneck for the out-of-order window size. Swapping around indices might make this harder.

xor-zeroing is eliminated 100% of the time on Intel Sandybridge-family; it's assumed that this works by renaming to a physical zero register, and this register never needs to be freed.

If xchg used the same mechanism that mov-elimination does, it also could probably only work some of the time. It would need to decode to enough uops to work in cases where it isn't handled at rename. (Or else the issue/rename stage would have to insert extra uops when an xchg will take more than 1 uop, like it does when un-laminating micro-fused uops with indexed addressing modes that can't stay micro-fused in the ROB, or when inserting merging uops for flags or high-8 partial registers. But that's a significant complication that would only be worth doing if xchg was a common and important instruction.)

Note that xchg r32,r32 has to zero-extend both results to 64 bits, so it can't be a simple swap of RAT (Register Alias Table) entries. It would be more like truncating both registers in-place. And note that Intel CPUs never eliminate mov same,same. It does already need to support mov r32,r32 and movzx r32, r8 with no execution port, so presumably it has some bits that indicate that rax = al or something. (And yes, Intel HSW/SKL do that, not just Ivybridge, despite what Agner's microarch guide says.)

We know P6 and SnB had upper-zeroed bits like this, because xor eax,eax before setz al avoids a partial-register stall when reading eax. HSW/SKL never rename al separately in the first place, only ah. It may not be a coincidence that partial-register renaming (other than AH) seems to have been dropped in the same uarch that introduced mov-elimination (Ivybridge). Still, setting that bit for 2 registers at once would be a special case that required special support.

xchg r64,r64 could maybe just swap the RAT entries, but decoding that differently from the r32 case is yet another complication. It might still need to trigger partial-register merging for both inputs, but add r64,r64 needs to do that, too.

Also note that an Intel uop (other than fxch) only ever produces one register result (plus flags). Not touching flags doesn't "free up" an output slot; For example mulx r64,r64,r64 still takes 2 uops to produce 2 integer outputs on HSW/SKL, even though all the "work" is done in the multiply unit on port 1, same as with mul r64 which does produce a flag result.)

Even if it is as simple as "swap the RAT entries", building a RAT that supports writing more than one entry per uop is a complication. What to do when renaming 4 xchg uops in a single issue group? It seems to me like it would make the logic significantly more complicated. Remember that this has to be built out of logic gates / transistors. Even if you say "handle that special case with a trap to microcode", you have to build the whole pipeline to support the possibility that that pipeline stage could take that kind of exception.

Single-uop fxch requires support for swapping RAT entries (or some other mechanism) in the FP RAT (fRAT), but it's a separate block of hardware from the integer RAT (iRAT). Leaving out that complication in the iRAT seems reasonable even if you have it in the fRAT (pre-Haswell).

Issue/rename complexity is definitely an issue for power consumption, though. Note that Skylake widened a lot of the front-end (legacy decode and uop cache fetch), and retirement, but kept the 4-wide issue/rename limit. SKL also added replicated execution units on more port in the back-end, so issue bandwidth is a bottleneck even more of the time, especially in code with a mix of loads, stores, and ALU.

The RAT (or the integer register file, IDK) may even have limited read ports, since there seem to be some front-end bottlenecks in issuing/renaming many 3-input uops like add rax, [rcx+rdx]. I posted some microbenchmarks (this and the follow-up post) showing Skylake being faster than Haswell when reading lots of registers, e.g. with micro-fusion of indexed addressing modes. Or maybe the bottleneck there was really some other microarchitectural limit.


But how does 1-uop fxch work? IDK how it's done in Sandybridge / Ivybridge. In P6-family CPUs, an extra remapping table exists basically to support FXCH. That might only be needed because P6 uses a Retirement Register File with 1 entry per "logical" register, instead of a physical register file (PRF). As you say, you'd expect it to be simpler when even "cold" register values are just a pointer to a PRF entry. (Source: US patent 5,499,352: Floating point register alias table FXCH and retirement floating point register array (describes Intel's P6 uarch).

One main reason the rfRAT array 802 is included within the present invention fRAT logic is a direct result of the manner in which the present invention implements the FXCH instruction.

(Thanks Andy Glew (@krazyglew), I hadn't thought of looking up patents to find out about CPU internals.) It's pretty heavy going, but may provide some insight into the bookkeeping needed for speculative execution.

Interesting tidbit: the patent describes integer as well, and mentions that there are some "hidden" logical registers which are reserved for use by microcode. (Intel's 3-uop xchg almost certain uses one of these as a temporary.)


We might be able to get some insight from looking at what AMD does.

Interestingly, AMD has 2-uop xchg r,r in K10, Bulldozer-family, Bobcat/Jaguar, and Ryzen. (But Jaguar xchg r8,r8 is 3 uops. Maybe to support the xchg ah,al corner case without a special uop for swapping the low 16 of a single reg).

Presumably both uops read the old values of the input architectural registers before the first one updates the RAT. IDK exactly how this works, since they aren't necessarily issued/renamed in the same cycle (but they are at least contiguous in the uop flow, so at worst the 2nd uop is the first uop in the next cycle). I have no idea if Haswell's 2-uop fxch works similarly, or if they're doing something else.

Ryzen is a new architecture designed after mov-elimination was "invented", so presumably they take advantage of it wherever possible. (Bulldozer-family renames vector moves (but only for the low 128b lane of YMM vectors); Ryzen is the first AMD architecture to do it for GP regs too.) xchg r32,r32 and r64,r64 are zero-latency (renamed), but still 2 uops each. (r8 and r16 need an execution unit, because they merge with the old value instead of zero-extending or copying the entire reg, but are still only 2 uops).

Ryzen's fxch is 1 uop. AMD (like Intel) probably isn't spending a lot of transistors on making x87 fast (e.g. fmul is only 1 per clock and on the same port as fadd), so presumably they were able to do this without a lot of extra support. Their micro-coded x87 instructions (like fyl2x) are faster than on recent Intel CPUs, so maybe Intel cares even less (at least about the microcoded x87 instruction).

Maybe AMD could have made xchg r64,r64 a single uop too, more easily than Intel. Maybe even xchg r32,r32 could be single uop, since like Intel it needs to support mov r32,r32 zero-extension with no execution port, so maybe it could just set whatever "upper 32 zeroed" bit exists to support that. Ryzen doesn't eliminate movzx r32, r8 at rename, so presumably there's only an upper32-zero bit, not bits for other widths.


What Intel might be able to do cheaply if they wanted to:

It's possible that Intel could support 2-uop xchg r,r the way Ryzen does (zero latency for the r32,r32 and r64,r64 forms, or 1c for the r8,r8 and r16,r16 forms) without too much extra complexity in critical parts of the core, like the issue/rename and retirement stages that manage the Register Alias Table (RAT). But maybe not, if they can't have 2 uops read the "old" value of a register when the first uop writes it.

Stuff like xchg ah,al is definitely a extra complication, since Intel CPUs don't rename partial registers separately anymore, except AH/BH/CH/DH.


xchg latency in practice on current hardware

Your guess about how it might work internally is good. It almost certainly uses one of the internal temporary registers (accessible only to microcode). Your guess about how they can reorder is too limited, though. In fact, one direction has 2c latency and the other direction has ~1c latency.

00000000004000e0 <_start.loop>:
  4000e0:       48 87 d1                xchg   rcx,rdx   # slow version
  4000e3:       48 83 c1 01             add    rcx,0x1
  4000e7:       48 83 c1 01             add    rcx,0x1
  4000eb:       48 87 ca                xchg   rdx,rcx
  4000ee:       48 83 c2 01             add    rdx,0x1
  4000f2:       48 83 c2 01             add    rdx,0x1
  4000f6:       ff cd                   dec    ebp
  4000f8:       7f e6                   jg     4000e0 <_start.loop>

This loop runs in ~8.06 cycles per iteration on Skylake. Reversing the xchg operands makes it run in ~6.23c cycles per iteration (measured with perf stat on Linux). uops issued/executed counters are equal, so no elimination happened. It looks like the dst <- src direction is the slow one, since putting the add uops on that dependency chain makes things slower than when they're on the dst -> src dependency chain.

If you ever want to use xchg reg,reg on the critical path (code-size reasons?), do it with the dst -> src direction on the critical path, because that's only about 1c latency.


Other side-topics from comments and the question

The 3 micro-ops throws off my 4-1-1-1 cadence

Sandybridge-family decoders are different from Core2/Nehalem. They can produce up to 4 uops total, not 7, so the patterns are 1-1-1-1, 2-1-1, 3-1, or 4.

Also beware that if the last uop is one that can macro-fuse, they will hang onto it until the next decode cycle in case the first instruction in the next block is a jcc. (This is a win when code runs multiple times from the uop cache for each time it's decoded. And that's still usually 3 uops per clock decode throughput.)

Skylake has an extra "simple" decoder so it can do 1-1-1-1-1 up to 4-1 I guess, but > 4 uops for one instruction still requires the microcode ROM. Skylake beefed up the uop cache, too, and can often bottleneck on the 4 fused-domain uops per clock issue/rename throughput limit if the back-end (or branch misses) aren't a bottleneck first.

I'm literally searching for ~1% speed bumps so hand optimization has been working out on the main loop code. Unfortunately that's ~18kB of code so I'm not even trying to consider the uop cache anymore.

That seems kinda crazy, unless you're mostly limiting yourself to asm-level optimization in shorter loops inside your main loop. Any inner loops within the main loop will still run from the uop cache, and that should probably be where you're spending most of your time optimizing. Compilers usually do a good-enough job that it's not practical for a human to do much over a large scale. Try to write your C or C++ in such a way that the compiler can do a good job with it, of course, but looking for tiny peephole optimizations like this over 18kB of code seems like going down the rabbit hole.

Use perf counters like idq.dsb_uops vs. uops_issued.any to see how many of your total uops came from the uop cache (DSB = Decoded Stream Buffer or something). Intel's optimization manual has some suggestions for other perf counters to look at for code that doesn't fit in the uop cache, such as DSB2MITE_SWITCHES.PENALTY_CYCLES. (MITE is the legacy-decode path). Search the pdf for DSB to find a few places it's mentioned.

Perf counters will help you find spots with potential problems, e.g. regions with higher than average uops_issued.stall_cycles could benefit from finding ways to expose more ILP if there are any, or from solving a front-end problem, or from reducing branch-mispredicts.


As discussed in comments, a single uop produces at most 1 register result

As an aside, with a mul %rbx, do you really get %rdx and %rax all at once or does the ROB technically have access to the lower part of the result one cycle earlier than the higher part? Or is it like the "mul" uop goes into the multiplication unit and then the multiplication unit issues two uops straight into the ROB to write the result at the end?

Terminology: the multiply result doesn't go into the ROB. It goes over the forwarding network to whatever other uops read it, and goes into the PRF.

The mul %rbx instruction decodes to 2 uops in the decoders. They don't even have to issue in the same cycle, let alone execute in the same cycle.

However, Agner Fog's instruction tables only list a single latency number. It turns out that 3 cycles is the latency from both inputs to RAX. The minimum latency for RDX is 4c, according to InstlatX64 testing on both Haswell and Skylake-X.

From this, I conclude that the 2nd uop is dependent on the first, and exists to write the high half of the result to an architectural register. The port1 uop produces a full 128b multiply result.

I don't know where the high-half result lives until the p6 uop reads it. Perhaps there's some sort of internal queue between the multiply execution unit and hardware connected to port 6. By scheduling the p6 uop with a dependency on the low-half result, that might arrange for the p6 uops from multiple in-flight mul instructions to run in the correct order. But then instead of actually using that dummy low-half input, the uop would take the high half result from the queue output in an execution unit that's connected to port 6 and return that as the result. (This is pure guess work, but I think it's plausible as one possible internal implementation. See comments for some earlier ideas).

Interestingly, according to Agner Fog's instruction tables, on Haswell the two uops for mul r64 go to ports 1 and 6. mul r32 is 3 uops, and runs on p1 + p0156. Agner doesn't say whether that's really 2p1 + p0156 or p1 + 2p0156 like he does for some other insns. (However, he says that mulx r32,r32,r32 runs on p1 + 2p056 (note that p056 doesn't include p1).)

Even more strangely, he says that Skylake runs mulx r64,r64,r64 on p1 p5 but mul r64 on p1 p6. If that's accurate and not a typo (which is a possibility), it pretty much rules out the possibility that the extra uop is an upper-half multiplier.

Enesco answered 24/8, 2017 at 21:10 Comment(14)
What sort of limbo-like place is that high part of a product in until that other uop grabs it?Claudetta
@harold: that's the biggest hole in my reasoning. Maybe it can send it over the forwarding network somehow, and the scheduler has to arrange for the high-write uop to hit the other port within a reasonable time. So maybe the actual execution unit design bends the uops/ports rules a bit. Or maybe there's an internal high-half-result buffer that a multiply uop can write at the same time as writing a regular register. Just a buffer, not a renamed register, so the high-write uop has to run before another full-multiply uop can execute? ...Enesco
... That seems like it would make 1 mul per clock throughput hard to achieve, though. I think it's significant that mul/mulx r32 is 3 uops instead of 2, probably because it has to split the bottom 64 bits of the multiplier output into a low and high half. But I'm not sure what that tells us about mul r64. I'm leaning more towards the internal buffer theory; it seems unlikely that mul r64 just sends the high half over the forwarding network, or else the scheduler would have to know too much about the coupling between multiply uops.Enesco
Thanks for the very comprehensive answer. I'm relatively new to all this so I'm tripping over myself just trying to say things but thanks for bearing through that to actually share what you know. Just to clarify about the path through a single loop is about 18kB but ~50% of the profiling time is going into a ~700B function which has no loop structure. The function is indeed called multiple times by the hot loop. I've gotten some easy gains with vectorization, re-arranging high level logic, etc but now I'm down to single digit percent gains in this tough nut at the centre.Spireme
It's interesting to learn about the big two have approached this. I have much to learn from going through the links you provided still. I don't remember where I read this but I remember seeing that with PRF structures, each write instruction gets a new PRF register which only becomes the logical register at retirement. If this is the case, then the 2 uop xchg could be run out of order since the RAT would make sure they have the right value before the ROB. Still... I see what you mean about the RAT having a hard time with an "atomic" write and read of two registers all in one cycle.Spireme
@harold: Updated this answer with a plausible (I think) idea for high-half mul results. Maybe there's a queue, and a dummy input can get the high-half uops scheduled in order.Enesco
@jeteon: updated with testing results I left out earlier. xchg dst,src has only 1c latency for the dst->src direction, so that's the one with a single internal mov.Enesco
Those practical latency measurements are enlightening. There's certainly more to this than the simple 2c latency documented. I'm a bit confused about what you mean by "single internal mov". I'm guessing that if the 3 micro-ops are as I've guessed then the first two moves might be able to run in parallel (with register renaming, maybe even not at all with mov elimination) so maybe you'd observe just 1c latency with the one operand but a full 2c if you wanted the operand written to last since it depends on the internal "tmp" mov?Spireme
@jeteon: Keep in mind that the "documented" latency was calculated from running a long sequence of xchg %eax, %edx or something. (Agner Fog says he tests by repeating instructions). For example, Agner's shr %cl, %r32 numbers are semi-bogus, too. The 2c latency is from flag input to flag output. If you repeat shl %cl, %eax 100 times in a loop, you will measure 2c latency. But if you put it between add instructions or something that break the flag dep, you measure more like 1.2c average. See my experiment here: agner.org/optimize/blog/read.php?i=415#860Enesco
@jeteon: For xchg, you're on the right track looking at what can run in parallel, but getting hung up on one direction interacting with the other. The critical path in one direction is mov %rax, %tmp / mov %tmp, %rdx. The critical path in the other direction is mov %rdx, %rax. (But these are a special kind of mov uop that can't be eliminated, unfortunately. IDK why.) Anyway, using an internal tmp means there's doesn't have to be any interaction between the two directions. They'll schedule to different ports and run in oldest-ready first as usual.Enesco
@jeteon: oh also, re: variable-count shifts: Intel's optimization manual says something about the CPU being able to cancel the flag-producing uops if it turns out that the flag results of a shift are unused. So that saves execution-port throughput, but not front-end. Point is, if you haven't thought of even looking for a certain effect, the test you construct might not show any evidence for it. I went looking into shifts because I wondered if it was really as bad as 2c when the critical path was the data, not flags, and wondered what the uops were doing.Enesco
@jeteon: forgot to mention: Note what happens when xchg has one input ready but the other not. The corresponding output will be ready in 1 or 2 cycles, even if the other input is still not ready. So a long chain of imul, then an xchg, then a long chain of imul on the other side, then another xchg, could still be executed efficiently, overlapping both imul dep chains instead of being serialized by depending on each other inside xchg. (I tried a mini version of this with those short add chains, so I think my prediction is right.)Enesco
@jeteon: A register-renaming implementation would still have the same property, though. Changing the mapping for not-yet-produced registers is something mov-elimination already does. I think a single-uop zero-latency xchg or fxch avoids cross-coupling the data dependencies, because it never has to actually execute (and thus doesn't have to wait for both inputs to be ready before producing either output).Enesco
Fantastic answer. I didn't even know mul/imul actually has different latencies for the upper and lower results, I was mostly thinking it could reasonably have them. As to why it has them, I'd still think that the upper half is probably not fully computed when the lower half is ready. Having a buffer to store the upper half for a cycle doesn't make sense. It offers no advantage and puts the calculation of the upper result to the critical path of the lower result unnecessarily.Denature

© 2022 - 2024 — McMap. All rights reserved.