What setup does REP do?
Asked Answered
T

4

17

Quoting Intel® 64 and IA-32 architectures optimization reference manual, §2.4.6 "REP String Enhancement":

The performance characteristics of using REP string can be attributed to two components: startup overhead and data transfer throughput.

[...]

For REP string of larger granularity data transfer, as ECX value increases, the startup overhead of REP String exhibit step-wise increase:

  • Short string (ECX <= 12): the latency of REP MOVSW/MOVSD/MOVSQ is about 20 cycles,
  • Fast string (ECX >= 76: excluding REP MOVSB): the processor implementation provides hardware optimization by moving as many pieces of data in 16 bytes as possible. The latency of REP string latency will vary if one of the 16-byte data transfer spans across cache line boundary:

    • Split-free: the latency consists of a startup cost of about 40 cycles and each 64 bytes of data adds 4 cycles,
    • Cache splits: the latency consists of a startup cost of about 35 cycles and each 64 bytes of data adds 6 cycles.
  • Intermediate string lengths: the latency of REP MOVSW/MOVSD/MOVSQ has a startup cost of about 15 cycles plus one cycle for each iteration of the data movement in word/dword/qword.

(emphasis mine)

There is no further mention of such startup cost. What is it? What does it do and why does it take always more time?

Tori answered 24/11, 2015 at 19:21 Comment(7)
It's an implementation detail that's only visible in how the performance of the instruction changes depending on the value of ECX. The quoted text tells you how to calculate the visible difference. What more do you need to know?Mccaskill
@RossRidge We could let go on the first question, but the second and the third definitely not.Tori
Why not? If I said it was due to magic fairies what difference would it make?Mccaskill
@RossRidge: It turns out there actually is a real and interesting answer. As you say, it makes zero difference to how you optimize code. However, it makes more sense (and thus is easier to understand and remember the details) once you know that most of it is from lack of microcode branch prediction.Illegitimacy
@PeterCordes It's not clear if your answer actually addresses the original poster's question. It's an interesting answer, but not something he would "definitely" need to know.Mccaskill
@RossRidge: Good point; fixed my answer to take an educated guess at what the OP actually asked, not just why it was sometimes faster.Illegitimacy
This question is obviously not primarily opinion based.Tori
I
24

Note that only rep movs and rep stos are fast. repe/ne cmps and scas on current CPUs only loop 1 element at a time. (https://agner.org/optimize/ has some perf numbers, like 2 cycles per RCX count for repe cmpsb). They still have some microcode startup overhead, though.


The rep movs microcode has several strategies to choose from. If the src and dest don't overlap closely, the microcoded loop can transfer in 64b chunks larger. (This is the so-called "fast strings" feature introduced with P6 and occasionally re-tuned for later CPUs that support wider loads/stores). But if dest is only one byte from src, rep movs has to produce the exact same result you'd get from that many separate movs instructions.

So the microcode has to check for overlap, and probably for alignment (of src and dest separately, or relative alignment). It probably also chooses something based on small/medium/large counter values.

According to Andy Glew's comments on an answer to Why are complicated memcpy/memset superior?, conditional branches in microcode aren't subject to branch-prediction. So there's a significant penalty in startup cycles if the default not-taken path isn't the one actually taken, even for a loop that uses the same rep movs with the same alignment and size.

He supervised the initial rep string implementation in P6, so he should know. :)

REP MOVS uses a cache protocol feature that is not available to regular code. Basically like SSE streaming stores, but in a manner that is compatible with normal memory ordering rules, etc. // The "large overhead for choosing and setting up the right method" is mainly due to the lack of microcode branch prediction. I have long wished that I had implemented REP MOVS using a hardware state machine rather than microcode, which could have completely eliminated the overhead.

By the way, I have long said that one of the things that hardware can do better/faster than software is complex multiway branches.

Intel x86 have had "fast strings" since the Pentium Pro (P6) in 1996, which I supervised. The P6 fast strings took REP MOVSB and larger, and implemented them with 64 bit microcode loads and stores and a no-RFO cache protocol. They did not violate memory ordering, unlike ERMSB in iVB.

The big weakness of doing fast strings in microcode was (a) microcode branch mispredictions, and (b) the microcode fell out of tune with every generation, getting slower and slower until somebody got around to fixing it. Just like a library men copy falls out of tune. I suppose that it is possible that one of the missed opportunities was to use 128-bit loads and stores when they became available, and so on

In retrospect, I should have written a self-tuning infrastructure, to get reasonably good microcode on every generation. But that would not have helped use new, wider, loads and stores, when they became available. // The Linux kernel seems to have such an autotuning infrastructure, that is run on boot. // Overall, however, I advocate hardware state machines that can smoothly transition between modes, without incurring branch mispredictions. // It is debatable whether good microcode branch prediction would obviate this.

Based on this, my best guess at a specific answer is: the fast-path through the microcode (as many branches as possible actually take the default not-taken path) is the 15-cycle startup case, for intermediate lengths.

Since Intel doesn't publish the full details, black-box measurements of cycle counts for various sizes and alignments are the best we can do. Fortunately, that's all we need to make good choices. Intel's manual, and http://agner.org/optimize/, have good info on how to use rep movs.


Fun fact: without ERMSB (new in IvB): rep movsb is optimized for small-ish copies. It takes longer to start up than rep movsd or rep movsq for large (more than a couple hundred bytes, I think) copies, and even after that may not achieve the same throughput.

The optimal sequence for large aligned copies without ERMSB and without SSE/AVX (e.g. in kernel code) may be rep movsq and then clean-up with something like an unaligned mov that copies the last 8 bytes of the buffer, possibly overlapping with the last aligned chunk of what rep movsq did. (basically use glibc's small-copy memcpy strategy). But if the size might be smaller than 8 bytes, you need to branch unless it's safe to copy more bytes than needed. Or rep movsb is an option for cleanup if small code-size matters more than performance. (rep will copy 0 bytes if RCX = 0).

A SIMD vector loop is often at least slightly faster than rep movsb even on CPUs with Enhanced Rep Move/Stos B. Especially if alignment isn't guaranteed. (Enhanced REP MOVSB for memcpy, and see also Intel's optimization manual. Links in the x86 tag wiki)


Further details: I think there's some discussion somewhere on SO about testing how rep movsb affects out-of-order exec of surrounding instructions, how soon uops from later instructions can get into the pipeline. I think we found some info in an Intel patent that shed some light on the mechanism.

Microcode can use a kind of predicated load and store uop that lets it issue a bunch of uops without initially knowing the value of RCX. If it turns out RCX was a small value, some of those uops choose not to do anything.

I've done some testing of rep movsb on Skylake. It seems consistent with that initial-burst mechanism: below a certain threshold of size like 96 bytes or something, IIRC performance was nearly constant for any size. (With small aligned buffers hot in L1d cache). I had rep movs in a loop with an independent imul dependency chain, testing that it can overlap execution.

But then there was a significant dropoff beyond that size, presumably when the microcode sequencer finds out that it needs to emit more copy uops. So I think when the rep movsb microcoded-uop reaches the front of the IDQ, it gets the microcode sequencer to emit enough load + store uops for some fixed size, and a check to see if that was sufficient or if more are needed.

This is all from memory, I didn't re-test while updating this answer. If this doesn't match reality for anyone else, let me know and I'll check again.

Illegitimacy answered 24/11, 2015 at 23:34 Comment(10)
@krazyglew: I reposted your fast-string comments from that other thread here, as an answer. If you want the nerd points from this, feel free to copy the formatting and post it as your own answer.Illegitimacy
Nice answer and nice quote. I was lastly wondering why it would then increase such cost. I thought that: Given possibly overlapping and unaligned data, to get in one of the ideal state the CPU must movsb-s till such conditions are met. Of course this is inherently slow, also considering the lack of branch prediction. Anyways, I believe at a certain point it stops to increase for larger and larger values of %*cxTori
@black: yeah, for large counts the startup time is still only in the range of 40 cycles, so the total time is dominated by the copying time. However, misaligned pointers can slow down the whole copy (i.e. a multiplier onto the time, rather than just adding startup overhead). IDK about closely-overlapping. That might fall back to byte-at-a-time operation that's slow for the whole copy. It's rare enough for correct code to want to do a copy that overlaps so tightly that it degenerates into a stos (maybe with some weird width), so they prob. just let that be slow instead of having a branch.Illegitimacy
@black I think you need to read the document you quoted more carefully. The startup cost is fixed at 15, 20, 35, or 40 depending on the value of ECX and whether the transfer splits cache lines. It doesn't increase further for larger and larger values of ECX. Also the text you quoted specifically doesn't apply to REP MOVSB. That case is handled in the previous paragraph that you didn't quote.Mccaskill
I have to say I'm pleased to see that this question still raises interests even (nearly) 5 years after posting it. Thanks, Peter.Tori
@PeterCordes did this 96 byte burs outperform 3 movdqa with a ymm register? (doubtful, just curious if you remember)Pyrology
@Noah: No. The bottleneck was the loop-carried latency of the imul dependency chain, which was intentionally long enough to hide rep movsb startup if OoO exec was able to see past a rep movsb. That's what I was testing, that the front-end could start issuing uops from after the rep movsb again without waiting for its latency. If I removed that, then of course rep movsb startup would be much worse than 3x ymm load + store, total of 6 uops for the front-end.Illegitimacy
You know what, I think the no-RFO cache protocol he's talking about is called ItoM. There isn't an RFO because it's assumed that the whole line will be filled, therefore it only needs an I to E transition, and doesn't need to read in the current state of the line from another core/LLC/DRAM. It only needs permission to write the line, because it's all going to be overwritten. A typical S->E RFO also doesn't need to read the line, so I think by no-RFO he means it's not called an RFO, despite having the same functionality as an S->E RFO, just in the case of I->E, where it is different to an RFOUnderlie
where it is different to a regular I->E RFO that would read in the line, because in this protocol, the intention is to write the full cache line (as it's a big memcpy). ItoM signifies that only permission of the transition of the state is needed, and the 'M' instead of 'E' might imply that it is written to immediately, i.e. the state immediately becomes M, as opposed to it being a prefetch RFOUnderlie
@LewisKelsey: Yes, it still needs ownership, but saves DRAM bandwidth (and possibly row-open commands) on bringing in the data. Or with data hot in L3 or another private cache, saves write-back from the other core, and probably saves a cycle on the ring bus (32 bytes wide in modern CPUs so a full line would take 2 cycles instead of maybe 1 for just the ownership message). I forget how wide the links between sockets are in various systems, but probably narrower than 256 bits, so even more possible benefit there.Illegitimacy
C
9

The quote that you have given only applies to Nehalem microarchitecture (Intel Core i5, i7 and Xeon processors released in 2009 and 2010), and the Intel is explicit about it.

Before Nehalem, REP MOVSB was even slower. Intel is silent on what had happened in subsequent microarchitectures, but, then, with the Ivy Bridge microarchtecture (processors released in 2012 and 2013) Intel has introduced Enhanced REP MOVSB (we still need to check the corresponding CPUID bit) that allowed us to copy memory fast.

Cheapest versions of later processors - Kaby Lake "Celeron" and "Pentium", released in 2017, don't have AVX that could have been used for fast memory copy, but they still have the Enhanced REP MOVSB. That's why REP MOVSB is very beneficial on the processors released since 2013.

Surprisingly, Nehalem processors had quite fast REP MOVSD/MOVSQ implementation (but not REP MOVSW/MOVSB) for very large-sized blocks - just 4 cycles to copy each subsequent 64 bytes of data (if the data is aligned to cache line boundaries) after we've paid startup costs of 40 cycles - which is excellent when we copy 256 bytes and more, and you don't need to use XMM registers!

Thus, on Nehalem microarchitecture, REP MOVSB/MOVSW is almost useless, but REP MOVSD/MOVSQ is excellent when we need to copy more than 256 bytes of data and the data is aligned to cache line boundaries.

On previous Intel microarchitectures (before 2008) the startup costs are even higher. Intel x86 processors have had "fast strings" since the Pentium Pro (P6) in 1996. The P6 fast strings took REP MOVSB and larger, and implemented them with 64 bit microcode loads and stores and a non-RFO (Read For Ownership) cache protocol. They did not violate memory ordering, unlike ERMSB in Ivy Bridge.

The Ice Lake microarchitecture launched in September 2019 introduced the Fast Short REP MOV (FSRM). This feature can be tested by a CPUID bit. It was intended for strings of 128 bytes and less to also be quick, but, in fact, strings before 64 bytes are still slower with rep movsb than with, for example, simple 64-bit register copy. Besides that, FSRM is only implemented under 64-bit, not under 32-bit. At least on my i7-1065G7 CPU, rep movsb is only quick for small strings under 64-bit, but on the 32-bit architecture, strings have to be at least 4KB in order for rep movsb to start outperforming other methods.

Here are the tests of REP MOVS* when the source and destination was in L1 cache, of blocks large enough to not be seriously affected by startup costs, but not that large to exceed the L1 cache size. Source: http://users.atw.hu/instlatx64/

Yonah (2006-2008)

    REP MOVSB 10.91 B/c
    REP MOVSW 10.85 B/c
    REP MOVSD 11.05 B/c

Nehalem (2009-2010)

    REP MOVSB 25.32 B/c
    REP MOVSW 19.72 B/c
    REP MOVSD 27.56 B/c
    REP MOVSQ 27.54 B/c

Westmere (2010-2011)

    REP MOVSB 21.14 B/c
    REP MOVSW 19.11 B/c
    REP MOVSD 24.27 B/c

Ivy Bridge (2012-2013) - with Enhanced REP MOVSB

    REP MOVSB 28.72 B/c
    REP MOVSW 19.40 B/c
    REP MOVSD 27.96 B/c
    REP MOVSQ 27.89 B/c

SkyLake (2015-2016) - with Enhanced REP MOVSB

    REP MOVSB 57.59 B/c
    REP MOVSW 58.20 B/c
    REP MOVSD 58.10 B/c
    REP MOVSQ 57.59 B/c

Kaby Lake (2016-2017) - with Enhanced REP MOVSB

    REP MOVSB 58.00 B/c
    REP MOVSW 57.69 B/c
    REP MOVSD 58.00 B/c
    REP MOVSQ 57.89 B/c

As you see, the implementation of REP MOVS differs significantly from one microarchitecture to another.

According to Intel, on Nehalem, REP MOVSB startup costs for strings larger than 9 bytes are 50 cycles, but for REP MOVSW/MOVSD/MOVSQ they from 35 to 40 cycles - so REP MOVSB has larger startup costs; tests have shown that the overall performance is worst for REP MOVSW, not REP MOVSB on Nehalem and Westmere.

On Ivy Bridge, SkyLake and Kaby Lake, the results are the opposite for these instructions: REP MOVSB is faster than REP MOVSW/MOVSD/MOVSQ, albeit just slightly. On Ivy Bridge REP MOVSW is still a laggard, but on SkyLake and Kaby Lake REP MOVSW isn't worse than REP MOVSD/MOVSQ.

Please note that I have presented test results for both SkyLake and Kaby Lake, taken from the instaltx64 site just for the sake of confirmation - these architectures have the same cycle-per-instruction data.

Conclusion: you may use MOVSD/MOVSQ for very large memory blocks since it produces sufficient results on all Intel microarchitectures from Yohan to Kaby Lake. Although, on Yonan architectures and earlier, SSE copy may produce better results than REP MOVSD, but, for the sake of universality, REP MOVSD is preferred. Besides that, REP MOVS* may internally use different algorithms to work with cache, which is not available for normal instructions.

As about REP MOVSB for very small strings (less than 9 bytes or less than 4 bytes) - I would not even had recommended it. On the Kaby Lake, a single MOVSB even without REP is 4 cycles, on Yohan it is 5 cycles. Depending on context, you can do better just with normal MOVs.

The startup costs does not increase with size increase, as you have written. It is the latency of the overall instruction to complete the whole sequence of bytes that is increased - which is quite obvioius - more bytes you need to copy, more cycles it take, i.e. the overall latency, not just the startup cost. Intel did not disclose the startup cost for small strings, it did only specify for string of 76 bytes and more, for Nehalem. For example, take this data about the Nehalem:

  • The latency for MOVSB, is 9 cycles if ECX < 4. So, it means that it takes exactly 9 cycles to copy any string as soon as this string has 1 byte or 2 bytes or 3 bytes. This is not that bad – for example if you need to copy a tail and you don’t want to use orverlapping stores. Just 9 cycles to determine the size (between 1 and 3) and actually copy the data – it is hard to achieve this with normal instructions and all this branching – and for a 3-byte copy, if you didn’t copy previous data, you will have to use 2 loads and 2 stores (word+byte), and since we have at most one store unit, we wont’ do that much faster with normal MOV instructions.
  • Intel is silent on what latency has REP MOVSB if ECX is between 4 and 9
  • Short string (ECX <= 12): the latency of REP MOVSW/MOVSD/MOVSQ is about 20 cycles to copy the whole string – not just the startup cost of 20 cycles. So it takes about 20 cycles to copy the whole string of <= 12 bytes, thus we have a higher thoughoutput rate per byte than with REP MOVSB with ECX < 4.
  • ECX >= 76 with REP MOVSD/MOVSQ – yes, here we DO have startup cost of 40 cycles, but, this is more than reasonable, since we later use copy each 64 bytes of data at just 4 cycles. I’m not an Intel engineer authorized to reply WHY there is a startup costs, but I suppose that it is because for these strings, REP MOVS* uses (according to to Andy Glew's comments on an answer to Why are complicated memcpy/memset superior? from the Peter Cordes’ answer) a cache protocol feature that is not available to regular code. And there comes an explanation at this quote: „The large overhead for choosing and setting up the right method is mainly due to the lack of microcode branch prediction”. There has also been an interesting note that Pentium Pro (P6) in 1996 implemented REP MOVS* with 64 bit microcode loads and stores and a no-RFO cache protocol - they did not violate memory ordering, unlike ERMSB in Ivy Bridge.
Caning answered 15/7, 2017 at 21:53 Comment(0)
U
4

This patent shows that the decoder is able to determine whether the last move to rcx was immediate or whether it was modified in a manner such that the value in rcx is unknown at the decoder. It does this by setting a bit upon decoding an immediate mov to rcx and also calls this a 'fast string bit' and stores the immediate value in a register. The bit is cleared when it decodes an instruction that modifies rcx in an unknown manner. If the bit is set then it jumps to a position in a separate microcode routine which might be a size of 12 repetitions -- it jumps to repetition 7 if rcx = 5 i.e. the immediate value in the register it keeps is 5. This is a fast implementation that doesn't contain microbranches. If it is not set, in line with the SGX paper which talks about a 'microcode assist' for larger arrays, then it may emit a uop that traps to the slow looping microcode routine at retire, when the value of rcx is known, although this is more of a 'trap' uop that always traps rather than a uop that may result in an 'assist' being required. Alternatively, as the patent suggests ('otherwise, the instruction translator 206 transfers control to the looping REP MOVS microinstruction sequence') the MSROM could instead execute the slow routine inline and immediately, and it just continues issuing repetitions and looping until the branch mispredicts and is finally corrected to not taken and the microcode ends.

I would assume that the micro-branch in the main body of the regular (looping) MSROM procedure would be statically predicted taken by the uop itself (in the opcode), since this is a loop that's going to execute multiple times and mispredict once. This fast method would therefore only eliminate the branch misprediction at the end of the sequence as well as the micro-branch instruction per iteration, which reduces the number of uops. The main bulk of misprediction happens in the setup Peter mentions, which appears to be the setup of P6 'fast strings' (apparently unrelated to the term 'fast string' in the patent, which came after P6), or indeed ERMSB, which I think only happens in the slow (looping) routine mentioned by the patent. In the slow routine, if ecx >= 76, then it can be enhanced and goes through an initial setup process, but seemingly ecx needs to be above a certain size for it to actually be faster with the overhead of the startup process of 'fast strings' or ERMSB. This would entail the value of ecx being known, which is likely just a regular ecx comparison and jump that may mispredict. Apparently this slow routine enhancement also uses a different cache protocol, as discussed.

The microbranch misprediction is costly because it has to flush the whole pipeline, refetch the rep movs instruction and then resume decoding at the mispredicted micro-ip, returning to the MSROM procedure after it may have already finished decoding and other uops were being decoded behind it. The BOB can likely be used with microbranch mispredictions too, where it would be more beneficial than with a macrobranch misprediction. The RAT snapshot is likely associated with the ROB entry of every branch instruction.

Underlie answered 7/2, 2021 at 4:15 Comment(0)
R
2

Just from the description it sounds to me that there is an optimal transfer size of 16 bytes, so if you are transferring 79 bytes that is 4*16 + 15. so not knowing more about alignment that could mean that there is a cost for the 15 bytes either up front or at the end (or split) and the 4 16 byte transfers are faster than the fractions of 16. Kind of like high gear in your car vs shifting up through the gears to high gear.

Look at an optimized memcpy in glibc or gcc or other places. They transfer up to a few individual bytes, then they can maybe do 16 bit transfers until they get to an optimal aligned size of a 32 bit aligned, 64 bit aligned, 128 bit aligned address, then they can do multi-word transfers for the bulk of the copy, then they downshift, maybe one 32 bit thing maybe one 16 maybe 1 byte to cover the lack of alignment on the backend.

Sounds like the rep does the same kind of thing, innefficient single transfers to get to an optimized alignment size, then large transfers till near then end then maybe some small individual transfers to cover the last fraction.

Rundle answered 24/11, 2015 at 20:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.