Enhanced REP MOVSB for memcpy
Asked Answered
P

6

105

I would like to use enhanced REP MOVSB (ERMSB) to get a high bandwidth for a custom memcpy.

ERMSB was introduced with the Ivy Bridge microarchitecture. See the section "Enhanced REP MOVSB and STOSB operation (ERMSB)" in the Intel optimization manual if you don't know what ERMSB is.

The only way I know to do this directly is with inline assembly. I got the following function from https://groups.google.com/forum/#!topic/gnu.gcc.help/-Bmlm_EG_fE

static inline void *__movsb(void *d, const void *s, size_t n) {
  asm volatile ("rep movsb"
                : "=D" (d),
                  "=S" (s),
                  "=c" (n)
                : "0" (d),
                  "1" (s),
                  "2" (n)
                : "memory");
  return d;
}

When I use this however, the bandwidth is much less than with memcpy. __movsb gets 15 GB/s and memcpy get 26 GB/s with my i7-6700HQ (Skylake) system, Ubuntu 16.10, DDR4@2400 MHz dual channel 32 GB, GCC 6.2.

Why is the bandwidth so much lower with REP MOVSB? What can I do to improve it?

Here is the code I used to test this.

//gcc -O3 -march=native -fopenmp foo.c
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <stddef.h>
#include <omp.h>
#include <x86intrin.h>

static inline void *__movsb(void *d, const void *s, size_t n) {
  asm volatile ("rep movsb"
                : "=D" (d),
                  "=S" (s),
                  "=c" (n)
                : "0" (d),
                  "1" (s),
                  "2" (n)
                : "memory");
  return d;
}

int main(void) {
  int n = 1<<30;

  //char *a = malloc(n), *b = malloc(n);

  char *a = _mm_malloc(n,4096), *b = _mm_malloc(n,4096);
  memset(a,2,n), memset(b,1,n);

  __movsb(b,a,n);
  printf("%d\n", memcmp(b,a,n));

  double dtime;
  
  dtime = -omp_get_wtime();
  for(int i=0; i<10; i++) __movsb(b,a,n);
  dtime += omp_get_wtime();
  printf("dtime %f, %.2f GB/s\n", dtime, 2.0*10*1E-9*n/dtime);

  dtime = -omp_get_wtime();
  for(int i=0; i<10; i++) memcpy(b,a,n);
  dtime += omp_get_wtime();
  printf("dtime %f, %.2f GB/s\n", dtime, 2.0*10*1E-9*n/dtime);  
}

The reason I am interested in rep movsb is based off these comments

Note that on Ivybridge and Haswell, with buffers to large to fit in MLC you can beat movntdqa using rep movsb; movntdqa incurs a RFO into LLC, rep movsb does not... rep movsb is significantly faster than movntdqa when streaming to memory on Ivybridge and Haswell (but be aware that pre-Ivybridge it is slow!)

What's missing/sub-optimal in this memcpy implementation?


Here are my results on the same system from tinymembnech.

 C copy backwards                                     :   7910.6 MB/s (1.4%)
 C copy backwards (32 byte blocks)                    :   7696.6 MB/s (0.9%)
 C copy backwards (64 byte blocks)                    :   7679.5 MB/s (0.7%)
 C copy                                               :   8811.0 MB/s (1.2%)
 C copy prefetched (32 bytes step)                    :   9328.4 MB/s (0.5%)
 C copy prefetched (64 bytes step)                    :   9355.1 MB/s (0.6%)
 C 2-pass copy                                        :   6474.3 MB/s (1.3%)
 C 2-pass copy prefetched (32 bytes step)             :   7072.9 MB/s (1.2%)
 C 2-pass copy prefetched (64 bytes step)             :   7065.2 MB/s (0.8%)
 C fill                                               :  14426.0 MB/s (1.5%)
 C fill (shuffle within 16 byte blocks)               :  14198.0 MB/s (1.1%)
 C fill (shuffle within 32 byte blocks)               :  14422.0 MB/s (1.7%)
 C fill (shuffle within 64 byte blocks)               :  14178.3 MB/s (1.0%)
 ---
 standard memcpy                                      :  12784.4 MB/s (1.9%)
 standard memset                                      :  30630.3 MB/s (1.1%)
 ---
 MOVSB copy                                           :   8712.0 MB/s (2.0%)
 MOVSD copy                                           :   8712.7 MB/s (1.9%)
 SSE2 copy                                            :   8952.2 MB/s (0.7%)
 SSE2 nontemporal copy                                :  12538.2 MB/s (0.8%)
 SSE2 copy prefetched (32 bytes step)                 :   9553.6 MB/s (0.8%)
 SSE2 copy prefetched (64 bytes step)                 :   9458.5 MB/s (0.5%)
 SSE2 nontemporal copy prefetched (32 bytes step)     :  13103.2 MB/s (0.7%)
 SSE2 nontemporal copy prefetched (64 bytes step)     :  13179.1 MB/s (0.9%)
 SSE2 2-pass copy                                     :   7250.6 MB/s (0.7%)
 SSE2 2-pass copy prefetched (32 bytes step)          :   7437.8 MB/s (0.6%)
 SSE2 2-pass copy prefetched (64 bytes step)          :   7498.2 MB/s (0.9%)
 SSE2 2-pass nontemporal copy                         :   3776.6 MB/s (1.4%)
 SSE2 fill                                            :  14701.3 MB/s (1.6%)
 SSE2 nontemporal fill                                :  34188.3 MB/s (0.8%)

Note that on my system SSE2 copy prefetched is also faster than MOVSB copy.


In my original tests I did not disable turbo. I disabled turbo and tested again and it does not appear to make much of a difference. However, changing the power management does make a big difference.

When I do

sudo cpufreq-set -r -g performance

I sometimes see over 20 GB/s with rep movsb.

with

sudo cpufreq-set -r -g powersave

the best I see is about 17 GB/s. But memcpy does not seem to be sensitive to the power management.


I checked the frequency (using turbostat) with and without SpeedStep enabled, with performance and with powersave for idle, a 1 core load and a 4 core load. I ran Intel's MKL dense matrix multiplication to create a load and set the number of threads using OMP_SET_NUM_THREADS. Here is a table of the results (numbers in GHz).

              SpeedStep     idle      1 core    4 core
powersave     OFF           0.8       2.6       2.6
performance   OFF           2.6       2.6       2.6
powersave     ON            0.8       3.5       3.1
performance   ON            3.5       3.5       3.1

This shows that with powersave even with SpeedStep disabled the CPU still clocks down to the idle frequency of 0.8 GHz. It's only with performance without SpeedStep that the CPU runs at a constant frequency.

I used e.g sudo cpufreq-set -r performance (because cpufreq-set was giving strange results) to change the power settings. This turns turbo back on so I had to disable turbo after.

Pedropedrotti answered 11/4, 2017 at 10:22 Comment(51)
"What can I do to improve it?" ... basically nothing. The memcpy implementation in current version of compiler is very likely as close to the optimal solution, as you can get with any generic function. If you have some special case like always moving exactly 15 bytes/etc, then maybe a custom asm solution may beat the gcc compiler, but if your C source is vocal enough about what is happening (giving compiler good hints about alignment, length, etc), the compiler will very likely produce optimal machine code even for those specialized cases. You can try to improve the compiler output first.Bagatelle
@Ped7g, I don't expect it to be better than memcpy. I expect it to be about as good as memcpy. I used gdb to step through memcpy and I see that it enters a mainloop with rep movsb. So that appears to be what memcpy uses anyway (in some cases).Pedropedrotti
Change the order of the tests. What results do you get then?Boehmenist
@Art, I get the same result (26GB/s for memcpy and 15 GB/s for __movsb).Pedropedrotti
Fair enough. When in doubt, suspect the benchmark. But that doesn't seem to be the problem here. For what it's worth it's faster on an Ivy Bridge machine I had accessible (both when run first and second).Boehmenist
@Art, thats interesting! I wonder why that is on your IVB system. Yeah, benchmarking is a pain. I recently answered a question which I had to edit several times due to benchmarking problems that I did not expect.Pedropedrotti
@Boehmenist maybe enhanced rep movsb is not so enhanced on Skylake (my system)? Still I don't understand why you had to change the order.Pedropedrotti
@Zboson The order didn't matter for me either. The "Change the order" comment was before I found a machine with the right CPU. It's 50% faster too, which is quite significant. On the other hand, on another machine with a newer CPU the performance is reversed.Boehmenist
@Art, what function was 50% faster and on what machine?Pedropedrotti
@Zboson the movsb function was 50% faster on an Ivy Bridge machine.Boehmenist
@Zboson: No, I haven't heard of it I'm afraid. Is the term defined in the Intel instruction manual?Yingyingkow
@KerrekSB, yes, it's in section "3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB)Pedropedrotti
Interesting. Did you check with cpuid that the feature is available on your CPU?Yingyingkow
The optimization manual suggests that ERMSB is better at providing small code size and at throughput than traditional REP-MOV/STO, but "implementing memcpy using ERMSB might not reach the same level of throughput as using 256-bit or 128-bit AVX alternatives, depending on length and alignment factors." The way I understand this is that it's enhanced for situations where you might previously already have used rep instructions, but it does not aim to compete with modern high-throughput alternatives.Yingyingkow
@KerrekSB, no. I assume that processors since Ivy Bridge have it. less /proc/cpuinfo | grep erms shows erms.Pedropedrotti
@Zboson: Yeah, same thing, that's good enough.Yingyingkow
@KerrekSB, yeah, I read that statement but was confused by it. I am basing everything off of the comment " rep movsb is significantly faster than movntdqa when streaming to memory on Ivybridge and Haswell (but be aware that pre-Ivybridge it is slow!)" (see the update at the end of my question).Pedropedrotti
How about stepping through the machine code in a debugger and checking whether your memcpy actually uses movntdqa? It seems plausible that it would use SSE or AVX instructions instead. I have a feeling that ERMSB is meant to be better than some things, not better than everything.Yingyingkow
I have used gdb to study memcpy. For a size defined at run time it used non temporal stores and some prefetching. For the same size (1GB) defined at compile time it used rep movsb. I only looked at it once so it's possible I misinterpreted something. My own implementation using movntdqa does about as well as memcpy.Pedropedrotti
@Zboson: You need a better microbenchmark/timing test. Your source and destination are both AVX vector aligned, and that will affect how the compiler will implement the memcpy(). I've done similar tests using pregenerated pseudo-random source-target-length tuples, with different alignment situations tested separately, to better mimic real world use cases. But, my real-world code usually memcpys cold data, and timing cache-cold behaviour is hard. Perhaps consider timing some real-world memcpy()/memmove()-heavy task?Inductive
@NominalAnimal, that's an interesting point. You mean that ERMSB is useful in less ideal situations e.g. where destination and source are not aligned. I thought alignment was critical to ERMSB? In any case, if you can demonstrate where ERMSB is useful then that would be a good answer. Show me a better microbenchmark/timing test.Pedropedrotti
@NominalAnimal Intel's Optimization Manual, Table 3-4 claims that when both source and destination are at least 16B-aligned and the transfer size is 128-4096 bytes, ERMSB meets or exceeds Intel's own AVX-based memcpy(). Although no one knows what this memcpy() is, you can plausibly assume Intel would know how to get >50% of maximum bandwidth on their own chip.Unhandy
@IwillnotexistIdonotexist, you don't need to compare to memcpy. You could compare ERMSB to a SSE/AVX solution or better to a solution with non-temporal stores. That's what I would do in this case: use non-temporal stores. But this comment and the comment that followed said even in the 1GB case ERMSB should win. Shouldn't the non-temporal stores prevent the prefetchers from reading the destination? I thought that was the point in using them.Pedropedrotti
@Zboson My glibc's memcpy() uses AVX NT stores. And both NT stores and ERMSB behave in a write-combining fashion, and thus should not require RFO's. Nevertheless, my benchmarks on my own machine show that my memcpy() and my ERMSB both cap out at 2/3rds of total bandwidth, like your memcpy() (but not your ERMSB) did Therefore, there is clearly an extra bus transaction somewhere, and it stinks a lot like an RFO.Unhandy
@IwillnotexistIdonotexist your 2/3 observation is very interesting. I think I can get better than 2/3 using two threads. I'm not sure why my ERMSB on Skylake performs worse than your ERMSB on Haswell.Pedropedrotti
@IwillnotexistIdonotexist, did you see any benefit for more than 2 threads? I have not looked at performance counters before. That's a weakness I need to fix. What tools do you use for this? Agner Fog has a tool for this but it was a bit complicated. I should look into that again. What about perf? If you answer the question please share the details.Pedropedrotti
@Zboson At 2 threads it's about 21GB/s, 4+ it saturates at 23GB/s. I examine performance counters using some homebrew software I wrote: libpfc. It's nasty, far more limited than ocperf.py, only known to work on my own machine, only works properly for benching single-threaded code, but because I can easily (re)program the counters and access the timings from within the program, and I can tightly sandwich the code to be benchmarked, it suits my needs. Some day I'll have the time to fix its myriad issues.Unhandy
In case anyone cares here is a simpler inline assembly solution static void __movsb(void* dst, const void* src, size_t size) { __asm__ __volatile__("rep movsb" : "+D"(dst), "+S"(src), "+c"(size) : : "memory"); } which I found here hero.handmade.network/forums/code-discussion/t/…Pedropedrotti
It seems like a reasonable answer could be that it was faster, or at least as fast in IvB (per some results referenced here), but that the associated micro-code doesn't necessarily get love in each generation and so it becomes slower than the explicit code, which always uses the core functionality of the CPU that is guaranteed to be in tune, play nice with prefetching, etc. For example see Andy Glew's comment here:Kimono
The big weakness of doing fast strings in microcode was ... and (b) the microcode fell out of tune with every generation, getting slower and slower until somebody got around to fixing it. - Andy GlewKimono
It is also interesting to note that fast string performance is actually very relevant in, for example, Linux kernel methods like read() and write() which copy data into user-space: the kernel can't (doesn't) use any SIMD registers or SIMD code, so for a fast memcpy it either has to use 64-bit load/stores, or, more recently it will use rep movsb or rep rmovd if they are detected to be fast on the architecture. So they get a lot of the benefit of large moves without explicitly needing to use xmm or ymm regs.Kimono
Out of curiosity, are you calculating your bandwidth figures as 2 times the size of the memcpy length or as 1 times? I.e., is your figure a "memory bandwidth" figure or a "memcpy bandwidth" figure? Of course it doesn't change the relative performance between the techniques, but it helps me compare with my system.Kimono
@Kimono I am using 2 times the size of the memcpy length i.e. the memory bandwidth. Since you have the same processor as me did you test my code in my quesiton on it? If so did you get the same result? You have to compile with -mavx due to this bug. Try the exact compiler options I used gcc -O3 -march=native -fopenmp foo.c.Pedropedrotti
@Zboson - intesting - then your numbers look consistent with my box for the NT memcpy (about 13 GB copied, aka 26 GB/s BW), but not for the rep movsb where I see more than 20 GB/s BW, but you report only 15. I will try your code. BTW, I assume you disabled turbo for your tests (which is why you report 2.6 GHz?). I did, although I should have mentioned in explicitly in my answer.Kimono
@Zboson - I get rough 19.5 GB/s and 23.5 GB/s for rep movsb and memcpy respectively with your code. Very oddly inconsistent with your results, since we have the same CPU. There are all sorts of interesting stuff like "memory efficient turbo" that can play heavily here - let me play a bit. That's with turbo off. With turbo on I get roughly 20 vs 25. Turbo seems to help the memcpy version more than the rep movs version.Kimono
I get results closer to yours if I change to the powersave governor: about 17.5 GB/s vs 23.5 GB/s. I.e., the rep movsb perf drops but the memcpy doesn't. Indeed, repeated measurements show that with the powersave governor, my CPU only runs at about 2.3 GHz for the movs benchmark, but at 2.6 GHz for the memcpy one. So a significant part of the delta in your case is probably explained by power management. Basically power-efficient turbo (hereafter, PET) uses a heuristic to determine if the code is "memory stall bound" and ramps down the CPU since a high frequency is "pointless".Kimono
So rep movs gets unfavorable treatment (performance wise, perhaps it saves power, however!) from PET heuristic, perhaps because the heuristic sees it has a long stall on one instruction, while the highly unrolled AVX version is still executing lots of instructions. I have seen this before while testing some algorithm across a range of parameter values: at some value there is a much larger than expected drop in performance: but what happens is that suddenly the PET threshold was reached and the CPU ramped down (which still hurts performance).Kimono
@Kimono I did not disable Turbo in my tests. Why would that matter for memory bandwidth bound operations? Anyway, I just disabled it (I verified that it was disabled as well by running a custom frequency measuring tool) and it does not appear to make much of a difference. But changing the power management does make a difference. With performance rep movsb goes as high as 20 GB/s but with powersave it gets max 17 GB/s. I added this info to the end of my question.Pedropedrotti
@Kimono how does PET matter when turbo is disabled? I went into the Bios and disabled SpeedStep. Shouldn't that run the CPU at a flat frequency? Why would powersave or performance matter in this case if the CPU is running at a constant frequency?Pedropedrotti
Well PET is probably a misnomer since apparently it doesn't just affect frequencies above nominal, but rather the whole DVFS range. That makes sense - it isn't like nominal freq is particularly special: if it makes sense to reduce to 2.6 GHz, it may also make sense to reduce to 2.3 or 1.0 or whatever. Turning off SpeedStep will probably work, but it's easy to verify, just run grep /proc/cpu MHz a few times and observe the values, or fire up turbostat. I ran the benchmark like perf ./a.out to make my observation: it tells you the effect GHz for the process.Kimono
If your CPU is locked, powersave and performance perhaps shouldn't matter (there is still the un-discussed matter of uncore frequencies, which are independent, but no off-the-shelf tool reports them, as far as I know). Furthermore there may be other power saving aspects not directly related to frequency that is controlled by that setting (e.g., the aggressiveness of moving to higher C-states?).Kimono
About turbo, it can make a significant difference for memory related things, since it affects the uncore performance and so impacts latency since much of the latency of a memory access is uncore work, which is sped up by turbo (but this is also complex due to the interaction between the power-saving heuristics, and the fact that the uncore and core frequences are partly independent). Since our chips seems to hit the true DRAM BW limit (i.e., not a concurrency-occupancy limit per the discussion in "latency bound platforms" below), it may not apply and I don't see much effect on my CPU. @ZbosonKimono
@Boehmenist do you think you could run tinybenchmark (see BeeOnRope's answer) on your Ivy Bridge system and add the results to the end of my question?Pedropedrotti
@BeeOnRope. I checked the frequency. With powersave the CPU still idles at 0.8 GHz even with SpeedStep disabled. It's only with performance that the CPU is locked at 2.6 GHz with SpeedStep disabled. See the update at the end of my question.Pedropedrotti
@Zboson - right, I recall something similar: the intel_pstate driver will still use P-states to control frequency even if SS is off in the BIOS. You can also use intel_pstate=disable as a boot parameter to disable it completely, allowing you to use the default power management, including the "user" governer that sets the frequency at whatever you want (no turbo freqs tho). Interesting trivia: without intel_pstate, my chip would never run above 3.4 GHz (i.e., the last 100 MHz of turbo were inaccessible). With intel_pstate, no problem.Kimono
BTW, there is a whole interesting rathole to descend with this power saving stuff: e.g., running two benchmarks side-by-side can result in more than 2x total throughput (i.e., "superlinear scaling" with more threads, which is really weird) because one benchmark keeps core or uncore frequency high which helps the other one, but perhaps it deserves a whole separate question. I think powersaving is one part of somewhat poorer rep movsb performance, but not the whole story (even at equal MHz it's slower).Kimono
BTW, I measured the power use of rep movsb (in powersave at the lower freq) versus memcpy, but the power (i.e., watts) was only slightly less, and total energy consumed was higher (since it runs longer). So there is no power-saving benefit...Kimono
@Kimono Clearly there should be a chatroom RFB-x86 (Request For Benchmarks - x86) for the sole purpose of reverse-engineering the factors driving x86 processor performance.Unhandy
@Zboson - I am using commands like sudo cpupower -c 0,1,2,3 frequency-set -g performance - based on my understanding cpuopwer is the most-up-to-date and maintained of the commands for power management (it can also do things like adjust the "perf bias" on recent Intel chips). Using that command, switching to performance doesn't seem to affect turbo. I use this script to enable/disable turbo, although it seems perhaps /sys/devices/system/cpu/intel_pstate/no_turbo is simpler if you are using intel_pstate.Kimono
@Kimono re: 'Basically power-efficient turbo (hereafter, PET) uses a heuristic to determine if the code is "memory stall bound" and ramps down the CPU since a high frequency is "pointless"'. Any further reading on this? Not finding any of the keywords in Optimization Manual.Retsina
@Retsina - it was something I discovered here on SO while answering a question about why rep movsb was slower than explicit copy/store instructions: this effect explained some of the gap. I'm not aware of any discussion of it outside SO: you could search for that question and link it if you find it. I wasn't able to find it but didn't spend much time on it and the SO search returns suspiciously few results.Kimono
K
142

This is a topic pretty near to my heart and recent investigations, so I'll look at it from a few angles: history, some technical notes (mostly academic), test results on my box, and finally an attempt to answer your actual question of when and where rep movsb might make sense.

Partly, this is a call to share results - if you can run Tinymembench and share the results along with details of your CPU and RAM configuration it would be great. Especially if you have a 4-channel setup, an Ivy Bridge box, a server box, etc.

History and Official Advice

The performance history of the fast string copy instructions has been a bit of a stair-step affair - i.e., periods of stagnant performance alternating with big upgrades that brought them into line or even faster than competing approaches. For example, there was a jump in performance in Nehalem (mostly targeting startup overheads) and again in Ivy Bridge (most targeting total throughput for large copies). You can find decade-old insight on the difficulties of implementing the rep movs instructions from an Intel engineer in this thread.

For example, in guides preceding the introduction of Ivy Bridge, the typical advice is to avoid them or use them very carefully1.

The current (well, June 2016) guide has a variety of confusing and somewhat inconsistent advice, such as2:

The specific variant of the implementation is chosen at execution time based on data layout, alignment and the counter (ECX) value. For example, MOVSB/STOSB with the REP prefix should be used with counter value less than or equal to three for best performance.

So for copies of 3 or less bytes? You don't need a rep prefix for that in the first place, since with a claimed startup latency of ~9 cycles you are almost certainly better off with a simple DWORD or QWORD mov with a bit of bit-twiddling to mask off the unused bytes (or perhaps with 2 explicit byte, word movs if you know the size is exactly three).

They go on to say:

String MOVE/STORE instructions have multiple data granularities. For efficient data movement, larger data granularities are preferable. This means better efficiency can be achieved by decomposing an arbitrary counter value into a number of double words plus single byte moves with a count value less than or equal to 3.

This certainly seems wrong on current hardware with ERMSB where rep movsb is at least as fast, or faster, than the movd or movq variants for large copies.

In general, that section (3.7.5) of the current guide contains a mix of reasonable and badly obsolete advice. This is common throughput the Intel manuals, since they are updated in an incremental fashion for each architecture (and purport to cover nearly two decades worth of architectures even in the current manual), and old sections are often not updated to replace or make conditional advice that doesn't apply to the current architecture.

They then go on to cover ERMSB explicitly in section 3.7.6.

I won't go over the remaining advice exhaustively, but I'll summarize the good parts in the "why use it" below.

Other important claims from the guide are that on Haswell, rep movsb has been enhanced to use 256-bit operations internally.

Technical Considerations

This is just a quick summary of the underlying advantages and disadvantages that the rep instructions have from an implementation standpoint.

Advantages for rep movs

  1. When a rep movs instruction is issued, the CPU knows that an entire block of a known size is to be transferred. This can help it optimize the operation in a way that it cannot with discrete instructions, for example:
  • Avoiding the RFO request when it knows the entire cache line will be overwritten.
  • Issuing prefetch requests immediately and exactly. Hardware prefetching does a good job at detecting memcpy-like patterns, but it still takes a couple of reads to kick in and will "over-prefetch" many cache lines beyond the end of the copied region. rep movsb knows exactly the region size and can prefetch exactly.
  1. Apparently, there is no guarantee of ordering among the stores within3 a single rep movs which can help simplify coherency traffic and simply other aspects of the block move, versus simple mov instructions which have to obey rather strict memory ordering4.

  2. In principle, the rep movs instruction could take advantage of various architectural tricks that aren't exposed in the ISA. For example, architectures may have wider internal data paths that the ISA exposes5 and rep movs could use that internally.

Disadvantages

  1. rep movsb must implement a specific semantic which may be stronger than the underlying software requirement. In particular, memcpy forbids overlapping regions, and so may ignore that possibility, but rep movsb allows them and must produce the expected result. On current implementations mostly affects to startup overhead, but probably not to large-block throughput. Similarly, rep movsb must support byte-granular copies even if you are actually using it to copy large blocks which are a multiple of some large power of 2.

  2. The software may have information about alignment, copy size and possible aliasing that cannot be communicated to the hardware if using rep movsb. Compilers can often determine the alignment of memory blocks6 and so can avoid much of the startup work that rep movs must do on every invocation.

Test Results

Here are test results for many different copy methods from tinymembench on my i7-6700HQ at 2.6 GHz (too bad I have the identical CPU so we aren't getting a new data point...):

 C copy backwards                                     :   8284.8 MB/s (0.3%)
 C copy backwards (32 byte blocks)                    :   8273.9 MB/s (0.4%)
 C copy backwards (64 byte blocks)                    :   8321.9 MB/s (0.8%)
 C copy                                               :   8863.1 MB/s (0.3%)
 C copy prefetched (32 bytes step)                    :   8900.8 MB/s (0.3%)
 C copy prefetched (64 bytes step)                    :   8817.5 MB/s (0.5%)
 C 2-pass copy                                        :   6492.3 MB/s (0.3%)
 C 2-pass copy prefetched (32 bytes step)             :   6516.0 MB/s (2.4%)
 C 2-pass copy prefetched (64 bytes step)             :   6520.5 MB/s (1.2%)
 ---
 standard memcpy                                      :  12169.8 MB/s (3.4%)
 standard memset                                      :  23479.9 MB/s (4.2%)
 ---
 MOVSB copy                                           :  10197.7 MB/s (1.6%)
 MOVSD copy                                           :  10177.6 MB/s (1.6%)
 SSE2 copy                                            :   8973.3 MB/s (2.5%)
 SSE2 nontemporal copy                                :  12924.0 MB/s (1.7%)
 SSE2 copy prefetched (32 bytes step)                 :   9014.2 MB/s (2.7%)
 SSE2 copy prefetched (64 bytes step)                 :   8964.5 MB/s (2.3%)
 SSE2 nontemporal copy prefetched (32 bytes step)     :  11777.2 MB/s (5.6%)
 SSE2 nontemporal copy prefetched (64 bytes step)     :  11826.8 MB/s (3.2%)
 SSE2 2-pass copy                                     :   7529.5 MB/s (1.8%)
 SSE2 2-pass copy prefetched (32 bytes step)          :   7122.5 MB/s (1.0%)
 SSE2 2-pass copy prefetched (64 bytes step)          :   7214.9 MB/s (1.4%)
 SSE2 2-pass nontemporal copy                         :   4987.0 MB/s

Some key takeaways:

  • The rep movs methods are faster than all the other methods which aren't "non-temporal"7, and considerably faster than the "C" approaches which copy 8 bytes at a time.
  • The "non-temporal" methods are faster, by up to about 26% than the rep movs ones - but that's a much smaller delta than the one you reported (26 GB/s vs 15 GB/s = ~73%).
  • If you are not using non-temporal stores, using 8-byte copies from C is pretty much just as good as 128-bit wide SSE load/stores. That's because a good copy loop can generate enough memory pressure to saturate the bandwidth (e.g., 2.6 GHz * 1 store/cycle * 8 bytes = 26 GB/s for stores).
  • There are no explicit 256-bit algorithms in tinymembench (except probably the "standard" memcpy) but it probably doesn't matter due to the above note.
  • The increased throughput of the non-temporal store approaches over the temporal ones is about 1.45x, which is very close to the 1.5x you would expect if NT eliminates 1 out of 3 transfers (i.e., 1 read, 1 write for NT vs 2 reads, 1 write). The rep movs approaches lie in the middle.
  • The combination of fairly low memory latency and modest 2-channel bandwidth means this particular chip happens to be able to saturate its memory bandwidth from a single-thread, which changes the behavior dramatically.
  • rep movsd seems to use the same magic as rep movsb on this chip. That's interesting because ERMSB only explicitly targets movsb and earlier tests on earlier archs with ERMSB show movsb performing much faster than movsd. This is mostly academic since movsb is more general than movsd anyway.

Haswell

Looking at the Haswell results kindly provided by iwillnotexist in the comments, we see the same general trends (most relevant results extracted):

 C copy                                               :   6777.8 MB/s (0.4%)
 standard memcpy                                      :  10487.3 MB/s (0.5%)
 MOVSB copy                                           :   9393.9 MB/s (0.2%)
 MOVSD copy                                           :   9155.0 MB/s (1.6%)
 SSE2 copy                                            :   6780.5 MB/s (0.4%)
 SSE2 nontemporal copy                                :  10688.2 MB/s (0.3%)

The rep movsb approach is still slower than the non-temporal memcpy, but only by about 14% here (compared to ~26% in the Skylake test). The advantage of the NT techniques above their temporal cousins is now ~57%, even a bit more than the theoretical benefit of the bandwidth reduction.

When should you use rep movs?

Finally a stab at your actual question: when or why should you use it? It draw on the above and introduces a few new ideas. Unfortunately there is no simple answer: you'll have to trade off various factors, including some which you probably can't even know exactly, such as future developments.

A note that the alternative to rep movsb may be the optimized libc memcpy (including copies inlined by the compiler), or it may be a hand-rolled memcpy version. Some of the benefits below apply only in comparison to one or the other of these alternatives (e.g., "simplicity" helps against a hand-rolled version, but not against built-in memcpy), but some apply to both.

Restrictions on available instructions

In some environments there is a restriction on certain instructions or using certain registers. For example, in the Linux kernel, use of SSE/AVX or FP registers is generally disallowed. Therefore most of the optimized memcpy variants cannot be used as they rely on SSE or AVX registers, and a plain 64-bit mov-based copy is used on x86. For these platforms, using rep movsb allows most of the performance of an optimized memcpy without breaking the restriction on SIMD code.

A more general example might be code that has to target many generations of hardware, and which doesn't use hardware-specific dispatching (e.g., using cpuid). Here you might be forced to use only older instruction sets, which rules out any AVX, etc. rep movsb might be a good approach here since it allows "hidden" access to wider loads and stores without using new instructions. If you target pre-ERMSB hardware you'd have to see if rep movsb performance is acceptable there, though...

Future Proofing

A nice aspect of rep movsb is that it can, in theory take advantage of architectural improvement on future architectures, without source changes, that explicit moves cannot. For example, when 256-bit data paths were introduced, rep movsb was able to take advantage of them (as claimed by Intel) without any changes needed to the software. Software using 128-bit moves (which was optimal prior to Haswell) would have to be modified and recompiled.

So it is both a software maintenance benefit (no need to change source) and a benefit for existing binaries (no need to deploy new binaries to take advantage of the improvement).

How important this is depends on your maintenance model (e.g., how often new binaries are deployed in practice) and a very difficult to make judgement of how fast these instructions are likely to be in the future. At least Intel is kind of guiding uses in this direction though, by committing to at least reasonable performance in the future (15.3.3.6):

REP MOVSB and REP STOSB will continue to perform reasonably well on future processors.

Overlapping with subsequent work

This benefit won't show up in a plain memcpy benchmark of course, which by definition doesn't have subsequent work to overlap, so the magnitude of the benefit would have to be carefully measured in a real-world scenario. Taking maximum advantage might require re-organization of the code surrounding the memcpy.

This benefit is pointed out by Intel in their optimization manual (section 11.16.3.4) and in their words:

When the count is known to be at least a thousand byte or more, using enhanced REP MOVSB/STOSB can provide another advantage to amortize the cost of the non-consuming code. The heuristic can be understood using a value of Cnt = 4096 and memset() as example:

• A 256-bit SIMD implementation of memset() will need to issue/execute retire 128 instances of 32- byte store operation with VMOVDQA, before the non-consuming instruction sequences can make their way to retirement.

• An instance of enhanced REP STOSB with ECX= 4096 is decoded as a long micro-op flow provided by hardware, but retires as one instruction. There are many store_data operation that must complete before the result of memset() can be consumed. Because the completion of store data operation is de-coupled from program-order retirement, a substantial part of the non-consuming code stream can process through the issue/execute and retirement, essentially cost-free if the non-consuming sequence does not compete for store buffer resources.

So Intel is saying that after all some uops the code after rep movsb has issued, but while lots of stores are still in flight and the rep movsb as a whole hasn't retired yet, uops from following instructions can make more progress through the out-of-order machinery than they could if that code came after a copy loop.

The uops from an explicit load and store loop all have to actually retire separately in program order. That has to happen to make room in the ROB for following uops.

There doesn't seem to be much detailed information about how very long microcoded instruction like rep movsb work, exactly. We don't know exactly how micro-code branches request a different stream of uops from the microcode sequencer, or how the uops retire. If the individual uops don't have to retire separately, perhaps the whole instruction only takes up one slot in the ROB?

When the front-end that feeds the OoO machinery sees a rep movsb instruction in the uop cache, it activates the Microcode Sequencer ROM (MS-ROM) to send microcode uops into the queue that feeds the issue/rename stage. It's probably not possible for any other uops to mix in with that and issue/execute8 while rep movsb is still issuing, but subsequent instructions can be fetched/decoded and issue right after the last rep movsb uop does, while some of the copy hasn't executed yet. This is only useful if at least some of your subsequent code doesn't depend on the result of the memcpy (which isn't unusual).

Now, the size of this benefit is limited: at most you can execute N instructions (uops actually) beyond the slow rep movsb instruction, at which point you'll stall, where N is the ROB size. With current ROB sizes of ~200 (192 on Haswell, 224 on Skylake), that's a maximum benefit of ~200 cycles of free work for subsequent code with an IPC of 1. In 200 cycles you can copy somewhere around 800 bytes at 10 GB/s, so for copies of that size you may get free work close to the cost of the copy (in a way making the copy free).

As copy sizes get much larger, however, the relative importance of this diminishes rapidly (e.g., if you are copying 80 KB instead, the free work is only 1% of the copy cost). Still, it is quite interesting for modest-sized copies.

Copy loops don't totally block subsequent instructions from executing, either. Intel does not go into detail on the size of the benefit, or on what kind of copies or surrounding code there is most benefit. (Hot or cold destination or source, high ILP or low ILP high-latency code after).

Code Size

The executed code size (a few bytes) is microscopic compared to a typical optimized memcpy routine. If performance is at all limited by i-cache (including uop cache) misses, the reduced code size might be of benefit.

Again, we can bound the magnitude of this benefit based on the size of the copy. I won't actually work it out numerically, but the intuition is that reducing the dynamic code size by B bytes can save at most C * B cache-misses, for some constant C. Every call to memcpy incurs the cache miss cost (or benefit) once, but the advantage of higher throughput scales with the number of bytes copied. So for large transfers, higher throughput will dominate the cache effects.

Again, this is not something that will show up in a plain benchmark, where the entire loop will no doubt fit in the uop cache. You'll need a real-world, in-place test to evaluate this effect.

Architecture Specific Optimization

You reported that on your hardware, rep movsb was considerably slower than the platform memcpy. However, even here there are reports of the opposite result on earlier hardware (like Ivy Bridge).

That's entirely plausible, since it seems that the string move operations get love periodically - but not every generation, so it may well be faster or at least tied (at which point it may win based on other advantages) on the architectures where it has been brought up to date, only to fall behind in subsequent hardware.

Quoting Andy Glew, who should know a thing or two about this after implementing these on the P6:

the big weakness of doing fast strings in microcode was [...] the microcode fell out of tune with every generation, getting slower and slower until somebody got around to fixing it. Just like a library men copy falls out of tune. I suppose that it is possible that one of the missed opportunities was to use 128-bit loads and stores when they became available, and so on.

In that case, it can be seen as just another "platform specific" optimization to apply in the typical every-trick-in-the-book memcpy routines you find in standard libraries and JIT compilers: but only for use on architectures where it is better. For JIT or AOT-compiled stuff this is easy, but for statically compiled binaries this does require platform specific dispatch, but that often already exists (sometimes implemented at link time), or the mtune argument can be used to make a static decision.

Simplicity

Even on Skylake, where it seems like it has fallen behind the absolute fastest non-temporal techniques, it is still faster than most approaches and is very simple. This means less time in validation, fewer mystery bugs, less time tuning and updating a monster memcpy implementation (or, conversely, less dependency on the whims of the standard library implementors if you rely on that).

Latency Bound Platforms

Memory throughput bound algorithms9 can actually be operating in two main overall regimes: DRAM bandwidth bound or concurrency/latency bound.

The first mode is the one that you are probably familiar with: the DRAM subsystem has a certain theoretic bandwidth that you can calculate pretty easily based on the number of channels, data rate/width and frequency. For example, my DDR4-2133 system with 2 channels has a max bandwidth of 2.133 * 8 * 2 = 34.1 GB/s, same as reported on ARK.

You won't sustain more than that rate from DRAM (and usually somewhat less due to various inefficiencies) added across all cores on the socket (i.e., it is a global limit for single-socket systems).

The other limit is imposed by how many concurrent requests a core can actually issue to the memory subsystem. Imagine if a core could only have 1 request in progress at once, for a 64-byte cache line - when the request completed, you could issue another. Assume also very fast 50ns memory latency. Then despite the large 34.1 GB/s DRAM bandwidth, you'd actually only get 64 bytes / 50 ns = 1.28 GB/s, or less than 4% of the max bandwidth.

In practice, cores can issue more than one request at a time, but not an unlimited number. It is usually understood that there are only 10 line fill buffers per core between the L1 and the rest of the memory hierarchy, and perhaps 16 or so fill buffers between L2 and DRAM. Prefetching competes for the same resources, but at least helps reduce the effective latency. For more details look at any of the great posts Dr. Bandwidth has written on the topic, mostly on the Intel forums.

Still, most recent CPUs are limited by this factor, not the RAM bandwidth. Typically they achieve 12 - 20 GB/s per core, while the RAM bandwidth may be 50+ GB/s (on a 4 channel system). Only some recent gen 2-channel "client" cores, which seem to have a better uncore, perhaps more line buffers can hit the DRAM limit on a single core, and our Skylake chips seem to be one of them.

Now of course, there is a reason Intel designs systems with 50 GB/s DRAM bandwidth, while only being to sustain < 20 GB/s per core due to concurrency limits: the former limit is socket-wide and the latter is per core. So each core on an 8 core system can push 20 GB/s worth of requests, at which point they will be DRAM limited again.

Why I am going on and on about this? Because the best memcpy implementation often depends on which regime you are operating in. Once you are DRAM BW limited (as our chips apparently are, but most aren't on a single core), using non-temporal writes becomes very important since it saves the read-for-ownership that normally wastes 1/3 of your bandwidth. You see that exactly in the test results above: the memcpy implementations that don't use NT stores lose 1/3 of their bandwidth.

If you are concurrency limited, however, the situation equalizes and sometimes reverses, however. You have DRAM bandwidth to spare, so NT stores don't help and they can even hurt since they may increase the latency since the handoff time for the line buffer may be longer than a scenario where prefetch brings the RFO line into LLC (or even L2) and then the store completes in LLC for an effective lower latency. Finally, server uncores tend to have much slower NT stores than client ones (and high bandwidth), which accentuates this effect.

So on other platforms you might find that NT stores are less useful (at least when you care about single-threaded performance) and perhaps rep movsb wins where (if it gets the best of both worlds).

Really, this last item is a call for most testing. I know that NT stores lose their apparent advantage for single-threaded tests on most archs (including current server archs), but I don't know how rep movsb will perform relatively...

References

Other good sources of info not integrated in the above.

comp.arch investigation of rep movsb versus alternatives. Lots of good notes about branch prediction, and an implementation of the approach I've often suggested for small blocks: using overlapping first and/or last read/writes rather than trying to write only exactly the required number of bytes (for example, implementing all copies from 9 to 16 bytes as two 8-byte copies which might overlap in up to 7 bytes).


1 Presumably the intention is to restrict it to cases where, for example, code-size is very important.

2 See Section 3.7.5: REP Prefix and Data Movement.

3 It is key to note this applies only for the various stores within the single instruction itself: once complete, the block of stores still appear ordered with respect to prior and subsequent stores. So code can see stores from the rep movs out of order with respect to each other but not with respect to prior or subsequent stores (and it's the latter guarantee you usually need). It will only be a problem if you use the end of the copy destination as a synchronization flag, instead of a separate store.

4 Note that non-temporal discrete stores also avoid most of the ordering requirements, although in practice rep movs has even more freedom since there are still some ordering constraints on WC/NT stores.

5 This is was common in the latter part of the 32-bit era, where many chips had 64-bit data paths (e.g, to support FPUs which had support for the 64-bit double type). Today, "neutered" chips such as the Pentium or Celeron brands have AVX disabled, but presumably rep movs microcode can still use 256b loads/stores.

6 E.g., due to language alignment rules, alignment attributes or operators, aliasing rules or other information determined at compile time. In the case of alignment, even if the exact alignment can't be determined, they may at least be able to hoist alignment checks out of loops or otherwise eliminate redundant checks.

7 I'm making the assumption that "standard" memcpy is choosing a non-temporal approach, which is highly likely for this size of buffer.

8 That isn't necessarily obvious, since it could be the case that the uop stream that is generated by the rep movsb simply monopolizes dispatch and then it would look very much like the explicit mov case. It seems that it doesn't work like that however - uops from subsequent instructions can mingle with uops from the microcoded rep movsb.

9 I.e., those which can issue a large number of independent memory requests and hence saturate the available DRAM-to-core bandwidth, of which memcpy would be a poster child (and as apposed to purely latency bound loads such as pointer chasing).

Kimono answered 23/4, 2017 at 18:13 Comment(61)
ROB == reorder buffer? You might want to define that acronym on first use, unless you already did at the top and I missed it.Bitterling
tinymembench benchmark on my Haswell i7-4700MQ with DDR3 at 1600 MHz.Unhandy
@CodyGray - indeed, yes. I reworked that sentence and included a link to a definition of ROB.Kimono
@IwillnotexistIdonotexist your results have been added above.Kimono
@BeeOnRope: Here's my results; the file contains the system and compiler info. It has ERMS support, but the results indicate it's not that competitive on this system; explains my difficulties in finding a winning test for it. Also.. would you mind adding a comment to your answer that tinymembench only does 64-bit aligned copies and fills? Although perfectly applicable to the question posed here, it is strictly a subset of the typical use cases in real-world applications.Inductive
@NominalAnimal - thanks! I'll investigate the TMB alignment (since I think it is even more aligned that that) and add that. I was thinking that one place that rep movsb could win is on small, unaligned moves of random sizes, e.g., between 1 and 15 bytes. There the 9 cycle fixed latency is looking pretty good compared to a typical memcpy implementation that would have an unpredictable branch costing ~20 cycles per mispredict in addition to the actual memcpy logic. I haven't tested it though, so I didn't add it yet to reasons to use rep movsb list.Kimono
@NominalAnimal - your file seems to be the results from two runs of TMB concatenated. Are they from the same hardware/compiler/etc combination, or is there some difference I should note?Kimono
No -- same hardware, compiler, etc. (According to my commandline history, I had for some reason just run ./tinymembench >> tinymembench-...txt twice.) Here's a third run, still on the same hardware etc. I have (all three times) explicitly set the performance governor so the CPU frequency stays between 2750 and 2850 MHz, but as you can see, there is quite a bit of fluctuation. (I kept activity low, but not completely idle, during the runs.)Inductive
Thanks! Yeah, I have noticed that memory tests are much more susceptible to variation than CPU-exec bound stuff. It makes sense: memory is inherently a shared resource, so any other activity eats into you bandwidth and competes for resources like page walkers, distracts the prefetchers and so on. For a pure CPU test, unless you are running lots of stuff, you are likely to have a core to yourself the whole time, and I can often get variation of < 0.1% without much work. @NominalAnimalKimono
Do we have a benchmark on an Ivy Bridge system yet?Pedropedrotti
@BeeOnRope, I was writing programs in Assembly language on 8086 and 80286 processors, mostly “Terminate and Stay Resident” programs. But since that I was mostly using high-level languages – Delphi and C/C++. I was just writing assembler subroutines occasionally, and didn’t follow the trends in the development of the processors. Now I have found out that since Pentium Pro the rules have changed dramatically since 80286 times. I’m now trying to catch up with the recent developments.Tucker
@Kimono - I’m now writing a memory Move() function for Delphi 64-bit since it doesn’t have an optimized version, just a slow one implemented in Pascal. My first versions are slow because of the startup costs (various comparisons). What would you recommend to read about modern branching?Tucker
@Kimono - I was experimenting with various ways to copy data - rep movsb vs "vmovdqa ymm". My preliminary feeling is that when the data is in the cache, vmovdqa is almost twice faster than movsb, but when the data is not in cache - it's the opposite - movsb is twice faster. There were just plain vmovdqa without any temporal or prefetch things. Although, it was hard to do real-life timing, all my tests were kind of "artificial". What would you recommend to read on cache?Tucker
@MaximMasiutin - the discussion of branch prediction is probably worth a whole separate question on SO, but the short answer is that the exact techniques for the most recent chips haven't been disclosed but you are probably looking at something very similar to TAGE on Intel and perceptons on AMD. More generally I just recommend fully reading guides 1, 2 and 3 from Agner.Kimono
The precise behavior usually doesn't matter though: just assume that unless your sequence of branches follows some simple(ish) repeating pattern, that the predictor will simply predict the direction it sees most often, and hence you'll pay a ~20 cycle penalty every time the branch goes the "other" way. You can easily examine the actual performance of every branch in your application with perf stat and perf record -e branch-misses:pp on Linux (and whatever the equivalent is on Windows).Kimono
@MaximMasiutin - about memory, read Drepper's What every programmer should know about memory and Agner's stuff that I mentioned above. For the whole rep movsb thing, you can start with this question. Basically rep movsb is using something like NT stores internally, so you should use NT stores in your explicit copy for an apples-to-apples comparison. You can also just look at the source for various library memcpy impls which are generally good.Kimono
@Kimono - I saw Microsoft Visual Studio memcpy implementation and it is basically good, but it doesn't use 256-bit registers.Tucker
That's odd, certainly the libc memcpy implementations for clang and gcc do. See also this answer which implies that MSVC will vectorize memcpy under the right conditions.Kimono
@Kimono - I will also analyze the AsmLib at agner.org/optimize/#asmlib - he has a nice implementation for 64-bit that even supports zmm registers.Tucker
Keep in mind that it is GPL licensed, so you can't use this in your proprietary code without also releasing your source. @MaximMasiutinKimono
@Kimono - thank you very much for the warning. I will just see what kind of ideas I may get. By the way, I'm very interested in different things that can be done in the CPU. We are using AES-NI in our software, and recently we have implemented CRC32 instruction. I am also eager to implement SHA-1 and SHA-256 recently added by Intel.Tucker
@Kimono - In around 2006, Intel engineers came up with the idea of enlarging the CRC-32 look-up table in order to process 4 or 8 bytes at once. Their code is available on SourceForge at sourceforge.net/projects/slicing-by-8 , known as Slicing-by-8. Don't you know whether Slicing-by-8 etc. (Slicing-by-4, Slicing-by-16) is encumbered by any patent? There is such a big number of active patents on CRC and CRC32, and they are written in such a language that it is very hard to understand. Do you have any idea on Slicing?Tucker
@MaximMasiutin - no, I have no idea about the patent situation surrounding CRC calculations.Kimono
For the no-CPU-dispatching keep-it-simple case, I think rep movsq/d is a much better recommendation than rep movsb. Especially given your claim that real CPUs with ERMSB have fast rep movsd too. I've read that on CPUs before ERMSB, rep movsb is optimized for small copies only. If that means it never ramps up to full bandwidth, that makes it very bad. Or for x86-64, an SSE2 copy loop is not terrible; solid performance except in L1D.Epencephalon
Oh, I think I misunderstood that. You imply earlier in this answer that IvB's rep movsd isn't ERMSB-optimized, and only Haswell-onward have fast rep movsd too.Epencephalon
re: overlap with other work. I think microcoded instructions monopolize the issue bandwidth until they're done issuing. But maybe not; the startup overhead with microcoded branches that can't use branch-prediction + speculative execution may somehow allow the front-end to fill the gaps with other instructions? Hmm, we could test by putting a very small copy in a loop with other work. Anyway, earlier stages of the front-end can keep going, though, and can have following instructions fetched/decoded, even if there were I-cache or ITLB misses while rep movsb uops were issuing.Epencephalon
I just tested on SKL-S with a 17-byte rep movsb in a 1G-iteration loop. Adding a block of 16 inc r and mov r,r instructions (with lots of ILP) delayed the loop significantly, from 27G cycles to 30G cycles. Even though the uops-issued throughput was only 88G per 30G cycles ~= 2.9 per clock. (With just the setup+rep movsb in the loop, 72G uops_issued.any in 27.023G core clock cycles ~= 2.66 uops per clock. Similar for a 4096B copy, extra ALU uops are not free.Epencephalon
You wrote that rep movsb allows them and must produce the expected result, however, according to Intel Optimization Manual, overlapping regions are prohibitively slow with rep movsbTucker
@peter are you using issue in the Intel sense or in the "everyone else" sense? AFAIK Intel reverses the definitions of "issue" and "dispatch" from the generally used ones. If I'm not mistaken the usually used ones are "ops are dispatched from the front end into the scheduler and subsequently issued from the scheduler into the EUs", and Intel is the opposite.Kimono
@BeeOnRope: Intel terminology: the issue/rename stage takes uops (or microcode-pointers) from the IDQ and adds uops to the ROB + RS.Epencephalon
@MaximMasiutin: overlapping src/dst for rep movs produces the expected result in terms of correctness, not performance :/Epencephalon
@PeterCordes - yes, I seem to have been inconsistent about movsd versus movsb, in some places claiming they have the same performance on erms platforms, but above I'm saying that earlier tests on earlier archs with ERMSB show movsb performing much faster than movsd. That's specific enough that I must have seen the data, but I can't find it in this thread. It may have come from one of these two big threads on RWT, or perhaps form the examples in the Intel manual.Kimono
For example, the Intel manual has Figure 3-4. Memcpy Performance Comparison for Lengths up to 2KB which shows that rep movsd (plus a trailing movsb for the last three bytes) on Ivy Bridge scales considerably worse than movsb up to 256 bytes, at which point the slope appears to be the same. There are some Ivy Bridge results here, which show rep movsd about 3% slower than rep movsb, but maybe that's within measurement error and not big even if not.Kimono
@PeterCordes - I don't necessarily agree with rep movsd being a "much better" option than rep movsb. For one thing, rep movsd doesn't implement memcpy since it works in chunks of 4, so you have to have some additional head or tail handling code to make it equivalent to movsb. For sizes > 3 maybe the best you can do is a final possibly-overlapping 4 byte store. For sizes <= 3 you probably need a branch (possible mispredict). What ends up being best likely depends on how common you expect erms hardware to be: there are plenty of domains where erms is pretty much ubiquitous even today.Kimono
I was thinking of a use-case where copy sizes are medium to large so they hide the overhead of a potentially unaligned or overlapping first4 / last4, and if we want to avoid CPU detection, not being terrible on any platform is probably more important than the last 20% of performance on recent ones. (Specifically, this SO question: multiple-of-16 block-copies in 16-bit code.) I wasn't thinking about code that's guaranteed to run on IVB or HSW or later, but that's an interesting case.Epencephalon
Well it doesn't have to be a guarantee - but someone implementing a custom memcpy for proprietary use may very well know that everything before IVB is very unimportant. In any case, scanning instlatx64.atw.hu I don't find really any semi-modern chips even before IVB where rep movsb is much worse. That only covers the L1, but usually the trend is that larger copies minimize the differences further. I did find this old Athlon where rep movsb was more than 4x worse (newer AMD chips seem better).Kimono
@PeterCordes - about the "overlapping work" claim, I pretty much got that idea from Intel's manual - see the quote I just added to the answer. It seems like they are saying that the rep movsb only takes one entry in the ROB so lots of subsequent code can fit in there and be processed in parallel. That is, at the point that the final store is issued, and you switch to issuing subsequent instructions, there should be room in the ROB, but an explicit loop may not have room? It seems like your 4096B test should/could have captured that.Kimono
It is worth noting, though, that the change from 27G cycles -> 30G cycles for 16G more instructions implies an IPC of 16 / 3 = 5.33 for those instructions, so there is definitely some overlapping going on (but it isn't in any way clear it's better than the overlapping you'd get from an explicit copy loop, as Intel claims).Kimono
I just re-read what Intel says. That's an interesting point; I had only been thinking about issue, not retirement. Maybe it matters most if the stores miss in cache, preventing early vmovaps stores from retiring once the store-buffer fills. And good point about marginal IPC > 4. High-IPC low-latency filler might be a bad test-case for overlap. A long loop-carried dependency chain of single-uop SQRTSS instructions would benefit a lot more overlap. Hmm, maybe with a serializing instruction in the middle of some SQRTSS to stop rep movsb from issuing while the RS was full of sqrt.Epencephalon
@PeterCordes - I started a chat to discuss a kind-of-sidetrack confusion about exactly where the MSROM issues to.Kimono
I have updated my answer and demonstrated that, surprisingly, REP MOVSD/MOVSQ is the universal solution that works excellent on all processors - no ERMSB is ever required to copy large blocks of memory fast.Tucker
Neat update with the comp.arch thread. I posted a reply (groups.google.com/d/msg/comp.arch/ULvFgEM_ZSY/wd_AES-uBAAJ) with some contributions. Dunno if you're following that thread to get notifications of new posts yourself.Epencephalon
Just came across an interesting paper: Main Memory and Cache Performance of Intel Sandy Bridge and AMD Bulldozer which looks at a 2P SnB for cache lines in various states. (Other socket, Exclusive, Shared, etc.)Epencephalon
@PeterCordes - thanks for the paper link. I had read it before but it was good to skim it again. What stuck out to me was how much getting a line "from L3" varies if it is the L1 or L2 of another core. It can be close to 3 times slower (40 vs 15 ns) if the transfer has to wait for the line to be invalidated in the L1 of another code (same socket). That's potentially relevant for concurrent algorithms and makes the idea of "cache control" like this or even something like this.Kimono
Follow up question about the remark on AVX and Linux Kernel unix.stackexchange.com/q/475956/3285Premature
@Kimono So this is not clear why most of memcpy implementations use ERMS instead of SSE2/AVX Non-Temporal stores which has higher throughput? Ubuntu 18.04, Kaby Lake __memmove_avx_unaligned_erms is used.Anselmi
@Anselmi - it isn't clear to me, you'd have to look at the glibc history to see if there were some benchmarks or whatever, but the difference is not too large for large sizes. I'm actually in the middle of writing this up: and there some other benefits of movsb, e.g., that it can detect whether the current CPU mode allows you to efficiently use 256-bit or 512-bit instructions (i.e., the power license allows it) and then use them. Writing it explicitly you can't easily do that: you have to make a static choice. I don't think that's why glibc did it though :).Kimono
@Kimono That was my bad understanding of the __memmove_avx_unaligned_erms implementation. It actually does use AVX Non-Temporal stores depending on the data size. The __x86_shared_non_temporal_threshold is 6 MiB on my machine. The more strange thing that I found is that in such case they use software prefetcht0 of the 4 consecutive cache lines. Doesn't HW prefetch work?Anselmi
@Kimono regarding "Avoiding the RFO request when it knows the entire cache line will be overwritten." Are there other conditions for this aswell? I tested with both temporal and non-temporal stores with zmm registers (so 64 byte store) and didn't see any reduction in RFO requests as compared with using ymm or xmm registers.Retsina
@Retsina I did the same test with temporal stores with the same result (still saw an RFO for every 64-byte store). About non-temporal stores, however, these should never do an RFO, assuming you write a full line, regardless of vector size? So NT stores are another case where the RFO is avoided, although the protocol seems a bit different than REP MOVSB (in particular, the lines are cleared from the cache). I believe both of these still do need a request to get the line in an exclusive state, but just don't need the data along with that permission: an RFO-ND (no data).Kimono
@Kimono if you interested a made a question about being unable to show that rep movsb does not optimize out RFO requests here. tl;dr; is for a given line touches I see same amount of RFO requests with rep movsb as temporal store memcpy. Only reason for less RFO requests is better prefetching on store address (so less lines an RFO is needed for).Retsina
@Kimono do you have any idea what perf counters to look at for RFO-ND? Also with non-temporal I strictly less RFO (so don't think its doing RFO-ND, think its just bypassing RFO all together).Retsina
@Retsina - I didn't follow "I strictly less RFO".Kimono
@Kimono ahh sorry. I mean with non-temporal stores I measure basically a 99-100% reduction in RFO requests.Retsina
@Retsina - I think let's continue the discussion on that other question.Kimono
@Kimono I have done multiple tests on Skylake and Kaby lake of rep movsb vs rep movsq, and rep movsq was faster. I have not done tinymembench then, used my own tests. Now I have access only to a Skylake-SP and run tinymembench that confirmed my findings. I will post the results. Therefore, Intel is somehow correct that it is better to use rep movsq or rep movsd rather than rep movsb.Tucker
Travis you were cited by a recently publication from google research. See [28].Retsina
@Kimono something related to rep movsb performance I've noticed (on Tigerlake at least) is that because it implements memmove, not memcpy if its doing a forward copy (rsi > rdi) having rdi 64 byte aligned improves performance. Likewise if its doing a backward copy (rsi < rdi) then having rdi + rcx 64 byte aligned improves performance. I observe no no benefit to aligning rdi if its doing a backward copy and no benefit to aligning rdi + rcx if its doing a forward copy. Not 100% related to the post but thought it was interesting enough to comment.Retsina
@Retsina - neat, makes sense.Kimono
@Kimono I misspoke. rep movsb does not implement memmove. it does something totally different for backward copyRetsina
@Retsina - oh yeah, right, I even knew that. There's sort of 3 valid approaches to copies: memcpy ("no overlap, or else"), memmove ("as if the entire region was copied to an intermediate buffer first") and "defined copy algorithm", where the sequence of reads and writes is specifically defined and it must behave as if it's done like that. std::copy and rep movs* are like that, both of which copy forward an element at a time (subject to "as if", as always). memcpy can be implemented without modification by either of the other two, but that's it.Kimono
T
21

Enhanced REP MOVSB (Ivy Bridge and later)

Ivy Bridge microarchitecture (processors released in 2012 and 2013) introduced Enhanced REP MOVSB (ERMSB). We still need to check the corresponding bit. ERMS was intended to allow us to copy memory fast with rep movsb.

Cheapest versions of later processors - Kaby Lake Celeron and Pentium, released in 2017, don't have AVX that could have been used for fast memory copy, but still have the Enhanced REP MOVSB. And some of Intel's mobile and low-power architectures released in 2018 and onwards, which were not based on SkyLake, copy about twice more bytes per CPU cycle with REP MOVSB than previous generations of microarchitectures.

Enhanced REP MOVSB (ERMSB) before the Ice Lake microarchitecture with Fast Short REP MOV (FSRM) was only faster than AVX copy or general-use register copy if the block size is at least 256 bytes. For the blocks below 64 bytes, it was much slower, because there is a high internal startup in ERMSB - about 35 cycles. The FSRM feature intended blocks before 128 bytes also be quick.

See the Intel Manual on Optimization, section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB) http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf (applies to processors which did not yet have FSRM):

  • startup cost is 35 cycles;
  • both the source and destination addresses have to be aligned to a 16-Byte boundary;
  • the source region should not overlap with the destination region;
  • the length has to be a multiple of 64 to produce higher performance;
  • the direction has to be forward (CLD).

As I said earlier, REP MOVSB (on processors before FSRM) begins to outperform other methods when the length is at least 256 bytes, but to see the clear benefit over AVX copy, the length has to be more than 2048 bytes. Also, it should be noted that merely using AVX (256-bit registers) or AVX-512 (512-bit registers) for memory copy may sometimes have dire consequences like AVX/SSE transition penalties or reduced turbo frequency. So the REP MOVSB is a safer way to copy memory than AVX.

On the effect of the alignment if REP MOVSB vs. AVX copy, the Intel Manual gives the following information:

  • if the source buffer is not aligned, the impact on ERMSB implementation versus 128-bit AVX is similar;
  • if the destination buffer is not aligned, the effect on ERMSB implementation can be 25% degradation, while 128-bit AVX implementation of memory copy may degrade only 5%, relative to 16-byte aligned scenario.

I have made tests on Intel Core i5-6600, under 64-bit, and I have compared REP MOVSB memcpy() with a simple MOV RAX, [SRC]; MOV [DST], RAX implementation when the data fits L1 cache:

REP MOVSB memory copy

 - 1622400000 data blocks of  32 bytes took 17.9337 seconds to copy;  2760.8205 MB/s
 - 1622400000 data blocks of  64 bytes took 17.8364 seconds to copy;  5551.7463 MB/s
 - 811200000 data blocks of  128 bytes took 10.8098 seconds to copy;  9160.5659 MB/s
 - 405600000 data blocks of  256 bytes took  5.8616 seconds to copy; 16893.5527 MB/s
 - 202800000 data blocks of  512 bytes took  3.9315 seconds to copy; 25187.2976 MB/s
 - 101400000 data blocks of 1024 bytes took  2.1648 seconds to copy; 45743.4214 MB/s
 - 50700000 data blocks of  2048 bytes took  1.5301 seconds to copy; 64717.0642 MB/s
 - 25350000 data blocks of  4096 bytes took  1.3346 seconds to copy; 74198.4030 MB/s
 - 12675000 data blocks of  8192 bytes took  1.1069 seconds to copy; 89456.2119 MB/s
 - 6337500 data blocks of  16384 bytes took  1.1120 seconds to copy; 89053.2094 MB/s

MOV RAX... memory copy

 - 1622400000 data blocks of  32 bytes took  7.3536 seconds to copy;  6733.0256 MB/s
 - 1622400000 data blocks of  64 bytes took 10.7727 seconds to copy;  9192.1090 MB/s
 - 811200000 data blocks of  128 bytes took  8.9408 seconds to copy; 11075.4480 MB/s
 - 405600000 data blocks of  256 bytes took  8.4956 seconds to copy; 11655.8805 MB/s
 - 202800000 data blocks of  512 bytes took  9.1032 seconds to copy; 10877.8248 MB/s
 - 101400000 data blocks of 1024 bytes took  8.2539 seconds to copy; 11997.1185 MB/s
 - 50700000 data blocks of  2048 bytes took  7.7909 seconds to copy; 12710.1252 MB/s
 - 25350000 data blocks of  4096 bytes took  7.5992 seconds to copy; 13030.7062 MB/s
 - 12675000 data blocks of  8192 bytes took  7.4679 seconds to copy; 13259.9384 MB/s

So, even on 128-bit blocks, REP MOVSB (on processors before FSRM) is slower than just a simple MOV RAX copy in a loop (not unrolled). The ERMSB implementation begins to outperform the MOV RAX loop only starting from 256-byte blocks.

Fast Short REP MOV (FSRM)

The Ice Lake microarchitecture launched in September 2019 introduced the Fast Short REP MOV (FSRM). This feature can be tested by a CPUID bit. It was intended for strings of 128 bytes and less to also be quick, but, in fact, strings before 64 bytes are still slower with rep movsb than with, for example, simple 64-bit register copy. Besides that, FSRM is only implemented under 64-bit, not under 32-bit. At least on my i7-1065G7 CPU, rep movsb is only quick for small strings under 64-bit, but on 32-bit strings have to be at least 4KB in order for rep movsb to start outperforming other methods.

Normal (not enhanced) REP MOVS on Nehalem (2009-2013)

Surprisingly, previous architectures (Nehalem and later, up to, but not including Ivy Bridge), that didn't yet have Enhanced REP MOVB, had relatively fast REP MOVSD/MOVSQ (but not REP MOVSB/MOVSW) implementation for large blocks, but not large enough to outsize the L1 cache.

Intel Optimization Manual (2.5.6 REP String Enhancement) gives the following information is related to Nehalem microarchitecture - Intel Core i5, i7 and Xeon processors released in 2009 and 2010, and later microarchitectures, including Sandy Bridge manufactured up to 2013.

REP MOVSB

The latency for MOVSB is 9 cycles if ECX < 4. Otherwise, REP MOVSB with ECX > 9 has a 50-cycle startup cost.

  • tiny string (ECX < 4): the latency of REP MOVSB is 9 cycles;
  • small string (ECX is between 4 and 9): no official information in the Intel manual, probably more than 9 cycles but less than 50 cycles;
  • long string (ECX > 9): 50-cycle startup cost.

MOVSW/MOVSD/MOVSQ

Quote from the Intel Optimization Manual (2.5.6 REP String Enhancement):

  • Short string (ECX <= 12): the latency of REP MOVSW/MOVSD/MOVSQ is about 20 cycles.
  • Fast string (ECX >= 76: excluding REP MOVSB): the processor implementation provides hardware optimization by moving as many pieces of data in 16 bytes as possible. The latency of REP string latency will vary if one of the 16-byte data transfer spans across cache line boundary:
  • = Split-free: the latency consists of a startup cost of about 40 cycles, and every 64 bytes of data adds 4 cycles.
  • = Cache splits: the latency consists of a startup cost of about 35 cycles, and every 64 bytes of data adds 6 cycles.
  • Intermediate string lengths: the latency of REP MOVSW/MOVSD/MOVSQ has a startup cost of about 15 cycles plus one cycle for each iteration of the data movement in word/dword/qword.

Therefore, according to Intel, for very large memory blocks, REP MOVSW is as fast as REP MOVSD/MOVSQ. Anyway, my tests have shown that only REP MOVSD/MOVSQ are fast, while REP MOVSW is even slower than REP MOVSB on Nehalem and Westmere.

According to the information provided by Intel in the manual, on previous Intel microarchitectures (before 2008) the startup costs are even higher.

Conclusion: if you just need to copy data that fits L1 cache, just 4 cycles to copy 64 bytes of data is excellent, and you don't need to use XMM registers!

#REP MOVSD/MOVSQ is the universal solution that works excellent on all Intel processors (no ERMSB required) if the data fits L1 cache #

Here are the tests of REP MOVS* when the source and destination was in the L1 cache, of blocks large enough to not be seriously affected by startup costs, but not that large to exceed the L1 cache size. Source: http://users.atw.hu/instlatx64/

Yonah (2006-2008)

    REP MOVSB 10.91 B/c
    REP MOVSW 10.85 B/c
    REP MOVSD 11.05 B/c

Nehalem (2009-2010)

    REP MOVSB 25.32 B/c
    REP MOVSW 19.72 B/c
    REP MOVSD 27.56 B/c
    REP MOVSQ 27.54 B/c

Westmere (2010-2011)

    REP MOVSB 21.14 B/c
    REP MOVSW 19.11 B/c
    REP MOVSD 24.27 B/c

Ivy Bridge (2012-2013) - with Enhanced REP MOVSB (all subsequent CPUs also have Enhanced REP MOVSB)

    REP MOVSB 28.72 B/c
    REP MOVSW 19.40 B/c
    REP MOVSD 27.96 B/c
    REP MOVSQ 27.89 B/c

SkyLake (2015-2016)

    REP MOVSB 57.59 B/c
    REP MOVSW 58.20 B/c
    REP MOVSD 58.10 B/c
    REP MOVSQ 57.59 B/c

Kaby Lake (2016-2017)

    REP MOVSB 58.00 B/c
    REP MOVSW 57.69 B/c
    REP MOVSD 58.00 B/c
    REP MOVSQ 57.89 B/c

I have presented test results for both SkyLake and Kaby Lake just for the sake of confirmation - these architectures have the same cycle-per-instruction data.

Cannon Lake, mobile (May 2018 - February 2020)

    REP MOVSB 107.44 B/c
    REP MOVSW 106.74 B/c
    REP MOVSD 107.08 B/c
    REP MOVSQ 107.08 B/c

Cascade lake, server (April 2019)

    REP MOVSB 58.72 B/c
    REP MOVSW 58.51 B/c
    REP MOVSD 58.51 B/c
    REP MOVSQ 58.20 B/c
    

Comet Lake, desktop, workstation, mobile (August 2019)

    REP MOVSB 58.72 B/c
    REP MOVSW 58.62 B/c
    REP MOVSD 58.72 B/c
    REP MOVSQ 58.72 B/c

Ice Lake, mobile (September 2019)

    REP MOVSB 102.40 B/c
    REP MOVSW 101.14 B/c
    REP MOVSD 101.14 B/c
    REP MOVSQ 101.14 B/c

Tremont, low power (September, 2020)

    REP MOVSB 119.84 B/c
    REP MOVSW 121.78 B/c
    REP MOVSD 121.78 B/c
    REP MOVSQ 121.78 B/c

Tiger Lake, mobile (October, 2020)

    REP MOVSB 93.27 B/c
    REP MOVSW 93.09 B/c
    REP MOVSD 93.09 B/c
    REP MOVSQ 93.09 B/c

As you see, the implementation of REP MOVS differs significantly from one microarchitecture to another. On some processors, like Ivy Bridge - REP MOVSB is fastest, albeit just slightly faster than REP MOVSD/MOVSQ, but no doubt that on all processors since Nehalem, REP MOVSD/MOVSQ works very well - you even don't need "Enhanced REP MOVSB", since, on Ivy Bridge (2013) with Enhacnced REP MOVSB, REP MOVSD shows the same byte per clock data as on Nehalem (2010) without Enhacnced REP MOVSB, while in fact REP MOVSB became very fast only since SkyLake (2015) - twice as fast as on Ivy Bridge. So this Enhacnced REP MOVSB bit in the CPUID may be confusing - it only shows that REP MOVSB per se is OK, but not that any REP MOVS* is faster.

The most confusing ERMSB implementation is on the Ivy Bridge microarchitecture. Yes, on very old processors, before ERMSB, REP MOVS* for large blocks did use a cache protocol feature that is not available to regular code (no-RFO). But this protocol is no longer used on Ivy Bridge that has ERMSB. According to Andy Glew's comments on an answer to "why are complicated memcpy/memset superior?" from a Peter Cordes answer, a cache protocol feature that is not available to regular code was once used on older processors, but no longer on Ivy Bridge. And there comes an explanation of why the startup costs are so high for REP MOVS*: „The large overhead for choosing and setting up the right method is mainly due to the lack of microcode branch prediction”. There has also been an interesting note that Pentium Pro (P6) in 1996 implemented REP MOVS* with 64 bit microcode loads and stores and a no-RFO cache protocol - they did not violate memory ordering, unlike ERMSB in Ivy Bridge.

As about rep movsb vs rep movsq, on some processors with ERMSB rep movsb is slightly faster (e.g., Xeon E3-1246 v3), on other rep movsq is faster (Skylake), and on other it is the same speed (e.g. i7-1065G7). However, I would go for rep movsq rather than rep movsb anyway.

Please also note that this answer is only relevant for the cases where the source and the destination data fits L1 cache. Depending on circumstances, the particularities of memory access (cache, etc.) should be taken into consideration. Please also note that the information in this answer is only related to Intel processors and not to the processors by other manufacturers like AMD that may have better or worse implementations of REP MOVS* instructions.

Tinymembench results

Here are some of the tinymembench results to show relative performance of the rep movsb and rep movsd.

Intel Xeon E5-1650V3

Haswell microarchitecture, ERMS, AVX-2, released on September 2014 for $583, base frequency 3.5 GHz, max turbo frequency: 3.8 GHz (one core), L2 cache 6 × 256 KB, L3 cache 15 MB, supports up to 4×DDR4-2133, installed 8 modules of 32768 MB DDR4 ECC reg (256GB total RAM).

 C copy backwards                                     :   7268.8 MB/s (1.5%)
 C copy backwards (32 byte blocks)                    :   7264.3 MB/s
 C copy backwards (64 byte blocks)                    :   7271.2 MB/s
 C copy                                               :   7147.2 MB/s
 C copy prefetched (32 bytes step)                    :   7044.6 MB/s
 C copy prefetched (64 bytes step)                    :   7032.5 MB/s
 C 2-pass copy                                        :   6055.3 MB/s
 C 2-pass copy prefetched (32 bytes step)             :   6350.6 MB/s
 C 2-pass copy prefetched (64 bytes step)             :   6336.4 MB/s
 C fill                                               :  11072.2 MB/s
 C fill (shuffle within 16 byte blocks)               :  11071.3 MB/s
 C fill (shuffle within 32 byte blocks)               :  11070.8 MB/s
 C fill (shuffle within 64 byte blocks)               :  11072.0 MB/s
 ---
 standard memcpy                                      :  11608.9 MB/s
 standard memset                                      :  15789.7 MB/s
 ---
 MOVSB copy                                           :   8123.9 MB/s
 MOVSD copy                                           :   8100.9 MB/s (0.3%)
 SSE2 copy                                            :   7213.2 MB/s
 SSE2 nontemporal copy                                :  11985.5 MB/s
 SSE2 copy prefetched (32 bytes step)                 :   7055.8 MB/s
 SSE2 copy prefetched (64 bytes step)                 :   7044.3 MB/s
 SSE2 nontemporal copy prefetched (32 bytes step)     :  11794.4 MB/s
 SSE2 nontemporal copy prefetched (64 bytes step)     :  11813.1 MB/s
 SSE2 2-pass copy                                     :   6394.3 MB/s
 SSE2 2-pass copy prefetched (32 bytes step)          :   6255.9 MB/s
 SSE2 2-pass copy prefetched (64 bytes step)          :   6234.0 MB/s
 SSE2 2-pass nontemporal copy                         :   4279.5 MB/s
 SSE2 fill                                            :  10745.0 MB/s
 SSE2 nontemporal fill                                :  22014.4 MB/s

Intel Xeon E3-1246 v3

Haswell, ERMS, AVX-2, 3.50GHz

 C copy backwards                                     :   6911.8 MB/s
 C copy backwards (32 byte blocks)                    :   6919.0 MB/s
 C copy backwards (64 byte blocks)                    :   6924.6 MB/s
 C copy                                               :   6934.3 MB/s (0.2%)
 C copy prefetched (32 bytes step)                    :   6860.1 MB/s
 C copy prefetched (64 bytes step)                    :   6875.6 MB/s (0.1%)
 C 2-pass copy                                        :   6471.2 MB/s
 C 2-pass copy prefetched (32 bytes step)             :   6710.3 MB/s
 C 2-pass copy prefetched (64 bytes step)             :   6745.5 MB/s (0.3%)
 C fill                                               :  10812.1 MB/s (0.2%)
 C fill (shuffle within 16 byte blocks)               :  10807.7 MB/s
 C fill (shuffle within 32 byte blocks)               :  10806.6 MB/s
 C fill (shuffle within 64 byte blocks)               :  10809.7 MB/s
 ---
 standard memcpy                                      :  10922.0 MB/s
 standard memset                                      :  28935.1 MB/s
 ---
 MOVSB copy                                           :   9656.7 MB/s
 MOVSD copy                                           :   9430.1 MB/s
 SSE2 copy                                            :   6939.1 MB/s
 SSE2 nontemporal copy                                :  10820.6 MB/s
 SSE2 copy prefetched (32 bytes step)                 :   6857.4 MB/s
 SSE2 copy prefetched (64 bytes step)                 :   6854.9 MB/s
 SSE2 nontemporal copy prefetched (32 bytes step)     :  10774.2 MB/s
 SSE2 nontemporal copy prefetched (64 bytes step)     :  10782.1 MB/s
 SSE2 2-pass copy                                     :   6683.0 MB/s
 SSE2 2-pass copy prefetched (32 bytes step)          :   6687.6 MB/s
 SSE2 2-pass copy prefetched (64 bytes step)          :   6685.8 MB/s
 SSE2 2-pass nontemporal copy                         :   5234.9 MB/s
 SSE2 fill                                            :  10622.2 MB/s
 SSE2 nontemporal fill                                :  22515.2 MB/s (0.1%)

Intel Xeon Skylake-SP

Skylake, ERMS, AVX-512, 2.1 GHz (Xeon Gold 6152 at base frequency, no turbo)

 MOVSB copy                                           :   4619.3 MB/s (0.6%)
 SSE2 fill                                            :   9774.4 MB/s (1.5%)
 SSE2 nontemporal fill                                :   6715.7 MB/s (1.1%)

Intel Xeon E3-1275V6

Kaby Lake, released on March 2017 for $339, base frequency 3.8 GHz, max turbo frequency 4.2 GHz, L2 cache 4 × 256 KB, L3 cache 8 MB, 4 cores (8 threads), 4 RAM modules of 16384 MB DDR4 ECC installed, but it can use only 2 memory channels.

 MOVSB copy                                           :  11720.8 MB/s
 SSE2 fill                                            :  15877.6 MB/s (2.7%)
 SSE2 nontemporal fill                                :  36407.1 MB/s

Intel i7-1065G7

Ice Lake, AVX-512, ERMS, FSRM, 1.37 GHz (worked at the base frequency, turbo mode disabled)

MOVSB copy                                           :   7322.7 MB/s
SSE2 fill                                            :   9681.7 MB/s
SSE2 nontemporal fill                                :  16426.2 MB/s

AMD EPYC 7401P

Released on June 2017 at US $1075, based on Zen gen.1 microarchitecture, 24 cores (48 threads), base frequency: 2.0GHz, max turbo boost: 3.0GHz (few cores) or 2.8 (all cores); cache: L1 - 64 KB inst. & 32 KB data per core, L2 - 512 KB per core, L3 - 64 MB, 8 MB per CCX, DDR4-2666 8 channels, but only 4 RAM modules of 32768 MB each of DDR4 ECC reg. installed.

 MOVSB copy                                           :   7718.0 MB/s
 SSE2 fill                                            :  11233.5 MB/s
 SSE2 nontemporal fill                                :  34893.3 MB/s

AMD Ryzen 7 1700X (4 RAM modules installed)

 MOVSB copy                                           :   7444.7 MB/s
 SSE2 fill                                            :  11100.1 MB/s
 SSE2 nontemporal fill                                :  31019.8 MB/s

AMD Ryzen 7 Pro 1700X (2 RAM modules installed)

 MOVSB copy                                           :   7251.6 MB/s
 SSE2 fill                                            :  10691.6 MB/s
 SSE2 nontemporal fill                                :  31014.7 MB/s

AMD Ryzen 7 Pro 1700X (4 RAM modules installed)

 MOVSB copy                                           :   7429.1 MB/s
 SSE2 fill                                            :  10954.6 MB/s
 SSE2 nontemporal fill                                :  30957.5 MB/s

Conclusion

REP MOVSD/MOVSQ is the universal solution that works relatively well on all Intel processors for large memory blocks of at least 4KB (no ERMSB required) if the destination is aligned by at least 64 bytes. REP MOVSD/MOVSQ works even better on newer processors, starting from Skylake. And, for Ice Lake or newer microarchitectures, it works perfectly for even very small strings of at least 64 bytes.

Tucker answered 7/5, 2017 at 22:56 Comment(13)
Interesting L1D medium-size-buffer data. It may not be the whole story, though. Some of the benefits of ERMSB (like weaker ordering of the stores) will only show up with larger buffers that don't fit in cache. Even regular fast-strings rep movs is supposed to use a no-RFO protocol, though, even on pre-ERMSB CPUs.Epencephalon
If I understand it correctly, you just scraped the L1D-only numbers from instlatx64 results. So the conclusion is really that all of movsb, movsd, movsq perform approximately the same on all recent Intel platforms. The most interesting takeaway is probably "don't use movsw". You don't compare to an explicit loop of mov instructions (including 16-byte moves on 64-bit platforms, which are guaranteed to be available), which will probably be faster in many cases. You don't know show what happens on AMD platforms, nor when the size exceeds the L1 size.Kimono
Finally, you should note that nothing other than rep movsb actually implements memcpy (and none of them implement memmove), so you need extra code for the other variants. This is only likely to matter at small sizes.Kimono
@Kimono - thank you for pointing out the drawbacks - I have updated the answer. I have explicitly stated that it is only applies for the cases when data is in the L1 cache (although it has already been specified, albeit not so explicitly), and I have also mentioned that it only relates to Intel, not AMD.Tucker
@PeterCordes - Yes, on very old processors, before ERMSB, REP MOVS* for large blocks did use a cache protocol feature that is not available to regular code (no-RFO). But this protocol is no longer used on Ivy Bridge that has ERMSB. According to to Andy Glew's comments on an answer to "why are complicated memcpy/memset superior?" from the your answer at https://mcmap.net/q/14560/-what-setup-does-rep-do , a cache protocol feature that is not available to regular code.Tucker
@PeterCordes - And there comes an explanation at this quote: „The large overhead for choosing and setting up the right method is mainly due to the lack of microcode branch prediction”. There has also been an interesting note that Pentium Pro (P6) in 1996 implemented REP MOVS* with 64 bit microcode loads and stores and a no-RFO cache protocol - they did not violate memory ordering, unlike ERMSB in Ivy Bridge.Tucker
Yes, that quote is exactly what I was referring to.Epencephalon
@MaximMasiutin - where do you get the ERMSB no longer uses a no-RFO protocol not available to regular code? It certainly still uses a non-RFO protocol, at least for large copies, since it gets performance that is really only possible with non-RFO (this is most obvious for stosb but it applies to the mov variants too). It's debatable whether this is still "not available to regular code" since you get much the same effect with NT stores, so it isn't clear whether if the "not available to regular code" just means NT stores on platforms that didn't have them, or something other than NT stores.Kimono
@Kimono - I've found it at https://mcmap.net/q/14560/-what-setup-does-rep-do quote: "Intel x86 have had "fast strings" since the Pentium Pro (P6) in 1996, which I supervised. The P6 fast strings took REP MOVSB and larger, and implemented them with 64 bit microcode loads and stores and a no-RFO cache protocol. They did not violate memory ordering, unlike ERMSB in iVB."Tucker
@Kimono - I guess that the "iVB" acronym in the above code means "Ivy Bridge"Tucker
Would be really interested in having AMD Zen And Zen 2 archs added to the benchmark table.Kilogrammeter
@BeeOnRope, I have added some Tinymembench results for AMD processors.Tucker
@PeterCordes, I have added some Tinymembench results for AMD processors, but reached the message limit, so had to cut some lines from the results. I hope you will find this data interesting.Tucker
V
9

You say that you want:

an answer that shows when ERMSB is useful

But I'm not sure it means what you think it means. Looking at the 3.7.6.1 docs you link to, it explicitly says:

implementing memcpy using ERMSB might not reach the same level of throughput as using 256-bit or 128-bit AVX alternatives, depending on length and alignment factors.

So just because CPUID indicates support for ERMSB, that isn't a guarantee that REP MOVSB will be the fastest way to copy memory. It just means it won't suck as bad as it has in some previous CPUs.

However just because there may be alternatives that can, under certain conditions, run faster doesn't mean that REP MOVSB is useless. Now that the performance penalties that this instruction used to incur are gone, it is potentially a useful instruction again.

Remember, it is a tiny bit of code (2 bytes!) compared to some of the more involved memcpy routines I have seen. Since loading and running big chunks of code also has a penalty (throwing some of your other code out of the cpu's cache), sometimes the 'benefit' of AVX et al is going to be offset by the impact it has on the rest of your code. Depends on what you are doing.

You also ask:

Why is the bandwidth so much lower with REP MOVSB? What can I do to improve it?

It isn't going to be possible to "do something" to make REP MOVSB run any faster. It does what it does.

If you want the higher speeds you are seeing from from memcpy, you can dig up the source for it. It's out there somewhere. Or you can trace into it from a debugger and see the actual code paths being taken. My expectation is that it's using some of those AVX instructions to work with 128 or 256bits at a time.

Or you can just... Well, you asked us not to say it.

Valdes answered 20/4, 2017 at 9:8 Comment(11)
I tested REP MOVSB for sizes in the L3 cache an indeed it is competitive with an SSE/AVX solution. But I have not found it to be clearly better yet. And for sizes larger than the L3 cache non-temporal stores still win big time. Your point about code size is an interesting one and worth considering. I don't know much about microcode. REP MOVSB is implemented with microcode so even though it does not use up much of the code cache and counts only as one instruction it may still use up many of the ports and/or micro-ops.Pedropedrotti
"have not found it to be clearly better yet." Better than what? "Enhanced" isn't the same as "Optimal." I haven't seen any place that promised that it would be the best performer. I don't believe that's what that cpu flag is intended to convey. It is better than it was on platforms where it incurred a penalty (over even a movq/cmp loop). "code size" isn't always easy to see. Just like memory which is stored in cache lines that gets swapped in and out of the cpu, so does code. Paging in a huge old memcpy means that some of your other code will get evicted.Valdes
See the end of my question where I quote a comment that claims that ERMSB should be better than non-temporal stores even for large sizes.Pedropedrotti
Not to be critical of Stephen, but perhaps he's just wrong? My IvyBridge box is giving roughly the same speeds for movsb, memcpy, _mm_load_si128(MOVDQA) and _mm_stream_load_si128(MOVNTDQA) using your test harness. Perhaps he's in a different environment? Your test is aligned, (fairly) large, and usermode (does he write drivers?). Also, as I start cranking down the sizes (16k) and cranking up the loops (10000000), I find that rep movsb is giving clearly better performance than the alternatives (3.5, 4.3, 9.7, 9.7). I'd want more context/evidence before I'd accept what he's saying as gospel.Valdes
Wait! You have evidence that rep movsb is better than the alternatives? I want to hear more about that. To clarify, I'm not looking for an answer that only shows where rep movsb is better for large arrays (maybe that is untrue anyway). I would be interested to see any example where rep movsb is better than alternatives.Pedropedrotti
This answer really nails what needs to be said. The key is that memcpy is highly optimized, doing all kinds of crazy things to get the most speed possible. If you study your library's implementation, you will probably be amazed. (Unless you're using Microsoft's compiler, then you may be disappointed, but you wouldn't be asking this question.) It's very unlikely that you're going to beat a hand-tuned memcpy function in speed, and if you could, then also very likely the Glibc folks would switch over to it when tuning for Ivy Bridge or whatever architecture supported these enhancements.Bitterling
The big advantage of REP MOVSB has always been its size. It was true on the 8088 where these CISC string instructions were a big deal, and size dwarfed most other concerns, given (A) the limited bus width, and (B) the extremely small prefetch queue. As these concerns diminished, and other instructions became faster, string instructions were less of a performance win. Intel has "enhanced" them a couple of times (circa PPro, and now apparently again with IVB), but that just brings them up to competitive—they're not "the fastest way to do it". The Intel manuals even point that out.Bitterling
The only time you would want to use the string instructions would be if small code size was your overriding concern. If you're calling a string-manipulation instruction from a library, you don't really care how large it is—you just want it to be fast, and that's what Glibc gives you. If you're hand-writing a loop in assembly and hand-wringing over cache eviction possibilities, then you might decide that REP MOVSB is the best solution. It won't be the highest bandwidth solution, but it might be the superior solution overall.Bitterling
To clarify: While I got rep movsb to run faster than memcpy, I'm on Windows. Taking apart the executable, it appears that the memcpy is calling the (not very impressive) memcpy version Cody was just discussing. I can post the code I used that attempts to use MOVDQA and MOVNTDQA, but I'm not sure what it would prove (beyond the fact that I'm not a world-class memcpy writer).Valdes
@CodyGray - you can still beat the "hand tuned" memcpy implementation in specific contexts for a variety of reasons: (a) you know more about your applications distribution of sizes/alignments and can make different tradeoffs than the standard library (b) you know more about your target microarchitecture(s) than the standard library designers who need to target a broad swatch of CPUs or (c) you can use features only recently introduced in CPUs and not yet incorporated into your version of the standard library ...Kimono
... or (d) you can relax some implementations constraints based on application-specific knowledge (e.g., you might know that it's always safe to overwrite up to a certain number of bytes beyond a copied region which helps with the tail handling).Kimono
I
9

This is not an answer to the stated question(s), only my results (and personal conclusions) when trying to find out.

In summary: GCC already optimizes memset()/memmove()/memcpy() (see e.g. gcc/config/i386/i386.c:expand_set_or_movmem_via_rep() in the GCC sources; also look for stringop_algs in the same file to see architecture-dependent variants). So, there is no reason to expect massive gains by using your own variant with GCC (unless you've forgotten important stuff like alignment attributes for your aligned data, or do not enable sufficiently specific optimizations like -O2 -march= -mtune=). If you agree, then the answers to the stated question are more or less irrelevant in practice.

(I only wish there was a memrepeat(), the opposite of memcpy() compared to memmove(), that would repeat the initial part of a buffer to fill the entire buffer.)


I currently have an Ivy Bridge machine in use (Core i5-6200U laptop, Linux 4.4.0 x86-64 kernel, with erms in /proc/cpuinfo flags). Because I wanted to find out if I can find a case where a custom memcpy() variant based on rep movsb would outperform a straightforward memcpy(), I wrote an overly complicated benchmark.

The core idea is that the main program allocates three large memory areas: original, current, and correct, each exactly the same size, and at least page-aligned. The copy operations are grouped into sets, with each set having distinct properties, like all sources and targets being aligned (to some number of bytes), or all lengths being within the same range. Each set is described using an array of src, dst, n triplets, where all src to src+n-1 and dst to dst+n-1 are completely within the current area.

A Xorshift* PRNG is used to initialize original to random data. (Like I warned above, this is overly complicated, but I wanted to ensure I'm not leaving any easy shortcuts for the compiler.) The correct area is obtained by starting with original data in current, applying all the triplets in the current set, using memcpy() provided by the C library, and copying the current area to correct. This allows each benchmarked function to be verified to behave correctly.

Each set of copy operations is timed a large number of times using the same function, and the median of these is used for comparison. (In my opinion, median makes the most sense in benchmarking, and provides sensible semantics -- the function is at least that fast at least half the time.)

To avoid compiler optimizations, I have the program load the functions and benchmarks dynamically, at run time. The functions all have the same form, void function(void *, const void *, size_t) -- note that unlike memcpy() and memmove(), they return nothing. The benchmarks (named sets of copy operations) are generated dynamically by a function call (that takes the pointer to the current area and its size as parameters, among others).

Unfortunately, I have not yet found any set where

static void rep_movsb(void *dst, const void *src, size_t n)
{
    __asm__ __volatile__ ( "rep movsb\n\t"
                         : "+D" (dst), "+S" (src), "+c" (n)
                         :
                         : "memory" );
}

would beat

static void normal_memcpy(void *dst, const void *src, size_t n)
{
    memcpy(dst, src, n);
}

using gcc -Wall -O2 -march=ivybridge -mtune=ivybridge using GCC 5.4.0 on aforementioned Core i5-6200U laptop running a linux-4.4.0 64-bit kernel. Copying 4096-byte aligned and sized chunks comes close, however.

This means that at least thus far, I have not found a case where using a rep movsb memcpy variant would make sense. It does not mean there is no such case; I just haven't found one.

(At this point the code is a spaghetti mess I'm more ashamed than proud of, so I shall omit publishing the sources unless someone asks. The above description should be enough to write a better one, though.)


This does not surprise me much, though. The C compiler can infer a lot of information about the alignment of the operand pointers, and whether the number of bytes to copy is a compile-time constant, a multiple of a suitable power of two. This information can, and will/should, be used by the compiler to replace the C library memcpy()/memmove() functions with its own.

GCC does exactly this (see e.g. gcc/config/i386/i386.c:expand_set_or_movmem_via_rep() in the GCC sources; also look for stringop_algs in the same file to see architecture-dependent variants). Indeed, memcpy()/memset()/memmove() has already been separately optimized for quite a few x86 processor variants; it would quite surprise me if the GCC developers had not already included erms support.

GCC provides several function attributes that developers can use to ensure good generated code. For example, alloc_align (n) tells GCC that the function returns memory aligned to at least n bytes. An application or a library can choose which implementation of a function to use at run time, by creating a "resolver function" (that returns a function pointer), and defining the function using the ifunc (resolver) attribute.

One of the most common patterns I use in my code for this is

some_type *pointer = __builtin_assume_aligned(ptr, alignment);

where ptr is some pointer, alignment is the number of bytes it is aligned to; GCC then knows/assumes that pointer is aligned to alignment bytes.

Another useful built-in, albeit much harder to use correctly, is __builtin_prefetch(). To maximize overall bandwidth/efficiency, I have found that minimizing latencies in each sub-operation, yields the best results. (For copying scattered elements to consecutive temporary storage, this is difficult, as prefetching typically involves a full cache line; if too many elements are prefetched, most of the cache is wasted by storing unused items.)

Inductive answered 22/4, 2017 at 13:36 Comment(2)
i5-6200U laptop is not Ivy Bridge. It's Skylake. It would be interesting to see tingybenchmark on a Ivy Bridge system.Pedropedrotti
@Zboson: Quite true; thanks for pointing it out. I don't know where I took that assumption from; probably pulled it from my behind. Ouch. Does explain my results, too.Inductive
G
4

There are far more efficient ways to move data. These days, the implementation of memcpy will generate architecture specific code from the compiler that is optimized based upon the memory alignment of the data and other factors. This allows better use of non-temporal cache instructions and XMM and other registers in the x86 world.

When you hard-code rep movsb prevents this use of intrinsics.

Therefore, for something like a memcpy, unless you are writing something that will be tied to a very specific piece of hardware and unless you are going to take the time to write a highly optimized memcpy function in assembly (or using C level intrinsics), you are far better off allowing the compiler to figure it out for you.

Goldsworthy answered 11/4, 2017 at 10:34 Comment(8)
Actually, with enhanced rep movsb, using rep movsd is slower. Please read what this feature means before writing answers like this.Moonlighting
Hmm... Thanks. I removed the rep movsd comment.Goldsworthy
I discussed a custom memcpy here. One comment is "Note that on Ivybridge and Haswell, with buffers to large to fit in MLC you can beat movntdqa using rep movsb; movntdqa incurs a RFO into LLC, rep movsb does not." I can get something as good as memcpy with movntdqa. My question is how to I do as good as that or better with rep movsb?Pedropedrotti
@Zboson I think you're treading into areas where it really depends on what you're doing. For me, I'm typically in this space writing video drivers for operating systems. In these cases modern DMA features that allow the move without passing through the processor are fastest, but all proprietary and unique. What's your actual application? What are you moving, from where and to where? Is it all within system memory?Goldsworthy
This is for education mostly. I am trying to learn about ERMSB. The end goal is to get the highest bandwidth possible from main memory. I provided the code in my question that I use. That's all I am doing.Pedropedrotti
This answer seems out of touch with the realities of "fast string move" instructions like ERMSB and repeats the fallacy that for the highest performance code you should let the compiler figure it out for you. Now granted, for most code, and most developers, to get high performance code you should let the compiler figure it out for you, but there is almost always a level beyond which a person well-versed in the details can make it faster (e.g., because they know more about the shape of teh data, etc). The question falls into that category since it explicitly mentions the fast string ops, etc.Kimono
@fuz: Actually, on all current CPUs that implement ERMSB, rep movsd is apparently fast, too. (Even though you're right that Intel only documents ERMSB as applying to rep movsdb/stosb)Epencephalon
Perhaps the rep movsd implementation simply calls directly into the same microcoded implementation as rep movsb with a copy count (rcx) multiplied by 4. They would still need to handle specially the 3 cases where the destination is exactly 1, 2 or 3 bytes ahead of the source since those behave differently for rep movsd (actually it would be interesting to test if such "close overlap" cases perform worse...).Kimono
T
2

As a general memcpy() guide:

a) If the data being copied is tiny (less than maybe 20 bytes) and has a fixed size, let the compiler do it. Reason: Compiler can use normal mov instructions and avoid the startup overheads.

b) If the data being copied is small (less than about 4 KiB) and is guaranteed to be aligned, use rep movsb (if ERMSB is supported) or rep movsd (if ERMSB is not supported). Reason: Using an SSE or AVX alternative has a huge amount of "startup overhead" before it copies anything.

c) If the data being copied is small (less than about 4 KiB) and is not guaranteed to be aligned, use rep movsb. Reason: Using SSE or AVX, or using rep movsd for the bulk of it plus some rep movsb at the start or end, has too much overhead.

d) For all other cases use something like this:

    mov edx,0
.again:
    pushad
.nextByte:
    pushad
    popad
    mov al,[esi]
    pushad
    popad
    mov [edi],al
    pushad
    popad
    inc esi
    pushad
    popad
    inc edi
    pushad
    popad
    loop .nextByte
    popad
    inc edx
    cmp edx,1000
    jb .again

Reason: This will be so slow that it will force programmers to find an alternative that doesn't involve copying huge globs of data; and the resulting software will be significantly faster because copying large globs of data was avoided.

Trahan answered 20/4, 2017 at 11:28 Comment(16)
"Using an SSE or AVX alternative has a huge amount of "startup overhead" before it copies anything." What is this huge amount of startup overhead you refer to? Can you give more details about this?Pedropedrotti
@Zboson: Checking if the start address is/isn't suitably aligned (for both source and dest), checking if size is a nice multiple, checking if rep movsb should be used anyway, etc (all with potential branch mispredictions). For most CPUs the SSE/AVX is turned off to save power when you're not using it, so you can get hit by "SSE/AVX turn on latency". Then function call overhead (too bloated to inline), which can include saving/restoring any SSE/AVX registers that were in use by caller. Finally, if nothing else used SSE/AVX there's extra saving/restoring SSE/AVX state during task switches.Trahan
@Zboson: Note that there is a "cross over point" where the performance improvement for actually copying data overcomes the overhead of all that mess. This cross over point varies (different CPUs, etc) but you it's mostly always a case of "SSE/AVX only helps when you shouldn't be copying so much in the first place".Trahan
@Zboson: Also; if people were smart they'd have multiple variations, like memcpy_small(), memcpy_large_unaligned(), memcpy_large_aligned(), etc. This would help to get rid of some of the startup overhead (the checking, etc). Unfortunately, people are more lazy than smart and (as far as I can tell) nobody actually does this.Trahan
good points, but for rep movsb you want align source and dest anyway. Doesn't the size need to be 64 bytes multiples to be efficient? If so they you have to check for the size just like with SIMD. Your power down point for AVX (I don't think that applies to SSE though) is interesting. Maybe the microcode for rep movsb uses SSE/AVX anyway?Pedropedrotti
I was thinking of using rep movsb in matrix multiplication code. I copy tiles of a large matrix (say 64x64 float tiles) to a contiguous float array and then do block multiplication. This makes a huge difference in performance. It is basically a memcpy over each row of the tile before the block multiply. Since the size of each row is small (256 bytes) and I know it's aligned I could use rep movsb. I will give that a try. I doubt it will help though.Pedropedrotti
For rep movsb there's no strict alignment requirements (it won't crash if everything is "extremely misaligned" like SSE will), but CPU may need to do a little extra work to handle misalignment. Also, typically CPU tries to do as much as it can in a "fetch cache line, store cache line" way (so it's a bit more like rep mov64b moving 64-byte cache lines at a time); and various CPUs have various restrictions (for how much alignment and/or overlap is needed to allow that).Trahan
REP MOVSB can't possibly have alignment issues because it works on bytes. Where you have alignment problems is when you want to move values larger than a single byte, like with REP MOVSD. That will generally be faster, but only when the source and destination are DWORD-aligned. It turns out that it isn't any easier to deal with misalignment in microcode than it is in code you write, so the misalignment penalty is still there with CISC-style string instructions, and isn't going away.Bitterling
I think the claim that there is a huge amount of startup overhead for the explicit approach is false. Many memcpy calls will be for very short regions so the libraries are generally optimized to have some not-terrible fast paths for short lengths. Even if you do do all the alignment checking, that is in fact very cheap (a couple ANDs, comparisons, etc) - almost to the point of being free. The primary cost, if any, is if your data pattern causes mispredictions in that initial code. E.g., maybe you are aligned 50% of the time, randomly, so you get a misprediction in the alignment check.Kimono
... also there is actually a lot of overhead to rep movsb and friends, usually much more than well-written explicit code. This answer already covers the details, with input from Andy Glew, who would know (most of the quotes come from this comment chain). So the string instructions have large actual overheads (dependent on copy size) and perform worse than the equivalent explicit instructions due to the lack of branch prediction in micro-code.Kimono
As I understand it, rep movsb is not very good for misaligned. Instead of doing an unaligned first and/or last vector, it's just slow the whole time. So your point 2) is bad advice. Do you have a reference for that claim?Epencephalon
BTW, glibc's memcpy uses a nice strategy for very short copies. It avoids a cleanup loop for leftover bytes because it loads vectors relative to the start and end of the buffer. Then stores. If the copy isn't a whole number of vectors, there's overlap, but that's fine. For larger copies, it aligns the store pointer. See this comment block in the source describing the strategies. (@BeeOnRope, this is a real example of how cheaply you can do the checks for small copies)Epencephalon
@BeeOnRope: Both comments were addressed to @ Brendan, since I was disagreeing with this answer. Sorry for the confusion, I was just pinging you in case you were interested in seeing an example of what you were talking about in an earlier comment about the startup overhead of for a vector memcpy being lowish, not to disagree with anything you said.Epencephalon
@CodyGray - in practice the alignment considerations are mostly the same for rep movsb and rep movsd (and rep movsq) on recent hardware. Sure, rep movsb conceptually works on bytes, but under the covers all of the string move instructions are trying to move larger chunks of bytes so they all benefit from better alignment (and this beneficial alignment is usually 16, 32 or 64 bytes, so not really related to the primitive sizes of the operations). It's similar to how memcpy implementations in general benefit from alignment even though they conceptually work on bytes.Kimono
@PeterCordes - I have updated my answer and demonstrated that, surprisingly, REP MOVSD/MOVSQ is the universal solution that works excellent on all processors - no ERMSB is ever required to copy large blocks of memory fast.Tucker
For some reason, gcc 7.2 with -O3 seems to like generating a call to memmove to copy exactly 7 bytes, and will even generate code for void test(char *p) { for (int i=0; i<6; i++) p[i] = p[i+1]; } that calls memmove. Are there any cases where that would be advantageous compared with movsb, an explicit loop, or using a word, halfword, and byte load and store?Kolo

© 2022 - 2025 — McMap. All rights reserved.