This is a topic pretty near to my heart and recent investigations, so I'll look at it from a few angles: history, some technical notes (mostly academic), test results on my box, and finally an attempt to answer your actual question of when and where rep movsb
might make sense.
Partly, this is a call to share results - if you can run Tinymembench and share the results along with details of your CPU and RAM configuration it would be great. Especially if you have a 4-channel setup, an Ivy Bridge box, a server box, etc.
History and Official Advice
The performance history of the fast string copy instructions has been a bit of a stair-step affair - i.e., periods of stagnant performance alternating with big upgrades that brought them into line or even faster than competing approaches. For example, there was a jump in performance in Nehalem (mostly targeting startup overheads) and again in Ivy Bridge (most targeting total throughput for large copies). You can find decade-old insight on the difficulties of implementing the rep movs
instructions from an Intel engineer in this thread.
For example, in guides preceding the introduction of Ivy Bridge, the typical advice is to avoid them or use them very carefully1.
The current (well, June 2016) guide has a variety of confusing and somewhat inconsistent advice, such as2:
The specific variant of the implementation is chosen at execution time
based on data layout, alignment and the counter (ECX) value. For
example, MOVSB/STOSB with the REP prefix should be used with counter
value less than or equal to three for best performance.
So for copies of 3 or less bytes? You don't need a rep
prefix for that in the first place, since with a claimed startup latency of ~9 cycles you are almost certainly better off with a simple DWORD or QWORD mov
with a bit of bit-twiddling to mask off the unused bytes (or perhaps with 2 explicit byte, word mov
s if you know the size is exactly three).
They go on to say:
String MOVE/STORE instructions have multiple data granularities. For
efficient data movement, larger data granularities are preferable.
This means better efficiency can be achieved by decomposing an
arbitrary counter value into a number of double words plus single byte
moves with a count value less than or equal to 3.
This certainly seems wrong on current hardware with ERMSB where rep movsb
is at least as fast, or faster, than the movd
or movq
variants for large copies.
In general, that section (3.7.5) of the current guide contains a mix of reasonable and badly obsolete advice. This is common throughput the Intel manuals, since they are updated in an incremental fashion for each architecture (and purport to cover nearly two decades worth of architectures even in the current manual), and old sections are often not updated to replace or make conditional advice that doesn't apply to the current architecture.
They then go on to cover ERMSB explicitly in section 3.7.6.
I won't go over the remaining advice exhaustively, but I'll summarize the good parts in the "why use it" below.
Other important claims from the guide are that on Haswell, rep movsb
has been enhanced to use 256-bit operations internally.
Technical Considerations
This is just a quick summary of the underlying advantages and disadvantages that the rep
instructions have from an implementation standpoint.
Advantages for rep movs
- When a
rep
movs instruction is issued, the CPU knows that an entire block of a known size is to be transferred. This can help it optimize the operation in a way that it cannot with discrete instructions, for example:
- Avoiding the RFO request when it knows the entire cache line will be overwritten.
- Issuing prefetch requests immediately and exactly. Hardware prefetching does a good job at detecting
memcpy
-like patterns, but it still takes a couple of reads to kick in and will "over-prefetch" many cache lines beyond the end of the copied region. rep movsb
knows exactly the region size and can prefetch exactly.
Apparently, there is no guarantee of ordering among the stores within3 a single rep movs
which can help simplify coherency traffic and simply other aspects of the block move, versus simple mov
instructions which have to obey rather strict memory ordering4.
In principle, the rep movs
instruction could take advantage of various architectural tricks that aren't exposed in the ISA. For example, architectures may have wider internal data paths that the ISA exposes5 and rep movs
could use that internally.
Disadvantages
rep movsb
must implement a specific semantic which may be stronger than the underlying software requirement. In particular, memcpy
forbids overlapping regions, and so may ignore that possibility, but rep movsb
allows them and must produce the expected result. On current implementations mostly affects to startup overhead, but probably not to large-block throughput. Similarly, rep movsb
must support byte-granular copies even if you are actually using it to copy large blocks which are a multiple of some large power of 2.
The software may have information about alignment, copy size and possible aliasing that cannot be communicated to the hardware if using rep movsb
. Compilers can often determine the alignment of memory blocks6 and so can avoid much of the startup work that rep movs
must do on every invocation.
Test Results
Here are test results for many different copy methods from tinymembench
on my i7-6700HQ at 2.6 GHz (too bad I have the identical CPU so we aren't getting a new data point...):
C copy backwards : 8284.8 MB/s (0.3%)
C copy backwards (32 byte blocks) : 8273.9 MB/s (0.4%)
C copy backwards (64 byte blocks) : 8321.9 MB/s (0.8%)
C copy : 8863.1 MB/s (0.3%)
C copy prefetched (32 bytes step) : 8900.8 MB/s (0.3%)
C copy prefetched (64 bytes step) : 8817.5 MB/s (0.5%)
C 2-pass copy : 6492.3 MB/s (0.3%)
C 2-pass copy prefetched (32 bytes step) : 6516.0 MB/s (2.4%)
C 2-pass copy prefetched (64 bytes step) : 6520.5 MB/s (1.2%)
---
standard memcpy : 12169.8 MB/s (3.4%)
standard memset : 23479.9 MB/s (4.2%)
---
MOVSB copy : 10197.7 MB/s (1.6%)
MOVSD copy : 10177.6 MB/s (1.6%)
SSE2 copy : 8973.3 MB/s (2.5%)
SSE2 nontemporal copy : 12924.0 MB/s (1.7%)
SSE2 copy prefetched (32 bytes step) : 9014.2 MB/s (2.7%)
SSE2 copy prefetched (64 bytes step) : 8964.5 MB/s (2.3%)
SSE2 nontemporal copy prefetched (32 bytes step) : 11777.2 MB/s (5.6%)
SSE2 nontemporal copy prefetched (64 bytes step) : 11826.8 MB/s (3.2%)
SSE2 2-pass copy : 7529.5 MB/s (1.8%)
SSE2 2-pass copy prefetched (32 bytes step) : 7122.5 MB/s (1.0%)
SSE2 2-pass copy prefetched (64 bytes step) : 7214.9 MB/s (1.4%)
SSE2 2-pass nontemporal copy : 4987.0 MB/s
Some key takeaways:
- The
rep movs
methods are faster than all the other methods which aren't "non-temporal"7, and considerably faster than the "C" approaches which copy 8 bytes at a time.
- The "non-temporal" methods are faster, by up to about 26% than the
rep movs
ones - but that's a much smaller delta than the one you reported (26 GB/s vs 15 GB/s = ~73%).
- If you are not using non-temporal stores, using 8-byte copies from C is pretty much just as good as 128-bit wide SSE load/stores. That's because a good copy loop can generate enough memory pressure to saturate the bandwidth (e.g., 2.6 GHz * 1 store/cycle * 8 bytes = 26 GB/s for stores).
- There are no explicit 256-bit algorithms in tinymembench (except probably the "standard"
memcpy
) but it probably doesn't matter due to the above note.
- The increased throughput of the non-temporal store approaches over the temporal ones is about 1.45x, which is very close to the 1.5x you would expect if NT eliminates 1 out of 3 transfers (i.e., 1 read, 1 write for NT vs 2 reads, 1 write). The
rep movs
approaches lie in the middle.
- The combination of fairly low memory latency and modest 2-channel bandwidth means this particular chip happens to be able to saturate its memory bandwidth from a single-thread, which changes the behavior dramatically.
rep movsd
seems to use the same magic as rep movsb
on this chip. That's interesting because ERMSB only explicitly targets movsb
and earlier tests on earlier archs with ERMSB show movsb
performing much faster than movsd
. This is mostly academic since movsb
is more general than movsd
anyway.
Haswell
Looking at the Haswell results kindly provided by iwillnotexist in the comments, we see the same general trends (most relevant results extracted):
C copy : 6777.8 MB/s (0.4%)
standard memcpy : 10487.3 MB/s (0.5%)
MOVSB copy : 9393.9 MB/s (0.2%)
MOVSD copy : 9155.0 MB/s (1.6%)
SSE2 copy : 6780.5 MB/s (0.4%)
SSE2 nontemporal copy : 10688.2 MB/s (0.3%)
The rep movsb
approach is still slower than the non-temporal memcpy
, but only by about 14% here (compared to ~26% in the Skylake test). The advantage of the NT techniques above their temporal cousins is now ~57%, even a bit more than the theoretical benefit of the bandwidth reduction.
When should you use rep movs
?
Finally a stab at your actual question: when or why should you use it? It draw on the above and introduces a few new ideas. Unfortunately there is no simple answer: you'll have to trade off various factors, including some which you probably can't even know exactly, such as future developments.
A note that the alternative to rep movsb
may be the optimized libc memcpy
(including copies inlined by the compiler), or it may be a hand-rolled memcpy
version. Some of the benefits below apply only in comparison to one or the other of these alternatives (e.g., "simplicity" helps against a hand-rolled version, but not against built-in memcpy
), but some apply to both.
Restrictions on available instructions
In some environments there is a restriction on certain instructions or using certain registers. For example, in the Linux kernel, use of SSE/AVX or FP registers is generally disallowed. Therefore most of the optimized memcpy
variants cannot be used as they rely on SSE or AVX registers, and a plain 64-bit mov
-based copy is used on x86. For these platforms, using rep movsb
allows most of the performance of an optimized memcpy
without breaking the restriction on SIMD code.
A more general example might be code that has to target many generations of hardware, and which doesn't use hardware-specific dispatching (e.g., using cpuid
). Here you might be forced to use only older instruction sets, which rules out any AVX, etc. rep movsb
might be a good approach here since it allows "hidden" access to wider loads and stores without using new instructions. If you target pre-ERMSB hardware you'd have to see if rep movsb
performance is acceptable there, though...
Future Proofing
A nice aspect of rep movsb
is that it can, in theory take advantage of architectural improvement on future architectures, without source changes, that explicit moves cannot. For example, when 256-bit data paths were introduced, rep movsb
was able to take advantage of them (as claimed by Intel) without any changes needed to the software. Software using 128-bit moves (which was optimal prior to Haswell) would have to be modified and recompiled.
So it is both a software maintenance benefit (no need to change source) and a benefit for existing binaries (no need to deploy new binaries to take advantage of the improvement).
How important this is depends on your maintenance model (e.g., how often new binaries are deployed in practice) and a very difficult to make judgement of how fast these instructions are likely to be in the future. At least Intel is kind of guiding uses in this direction though, by committing to at least reasonable performance in the future (15.3.3.6):
REP MOVSB and REP STOSB will continue to perform reasonably well on
future processors.
Overlapping with subsequent work
This benefit won't show up in a plain memcpy
benchmark of course, which by definition doesn't have subsequent work to overlap, so the magnitude of the benefit would have to be carefully measured in a real-world scenario. Taking maximum advantage might require re-organization of the code surrounding the memcpy
.
This benefit is pointed out by Intel in their optimization manual (section 11.16.3.4) and in their words:
When the count is known to be at least a thousand byte or more, using
enhanced REP MOVSB/STOSB can provide another advantage to amortize the
cost of the non-consuming code. The heuristic can be understood
using a value of Cnt = 4096 and memset() as example:
• A 256-bit SIMD implementation of memset() will need to issue/execute
retire 128 instances of 32- byte store operation with VMOVDQA, before
the non-consuming instruction sequences can make their way to
retirement.
• An instance of enhanced REP STOSB with ECX= 4096 is decoded as a
long micro-op flow provided by hardware, but retires as one
instruction. There are many store_data operation that must complete
before the result of memset() can be consumed. Because the completion
of store data operation is de-coupled from program-order retirement, a
substantial part of the non-consuming code stream can process through
the issue/execute and retirement, essentially cost-free if the
non-consuming sequence does not compete for store buffer resources.
So Intel is saying that after all some uops the code after rep movsb
has issued, but while lots of stores are still in flight and the rep movsb
as a whole hasn't retired yet, uops from following instructions can make more progress through the out-of-order machinery than they could if that code came after a copy loop.
The uops from an explicit load and store loop all have to actually retire separately in program order. That has to happen to make room in the ROB for following uops.
There doesn't seem to be much detailed information about how very long microcoded instruction like rep movsb
work, exactly. We don't know exactly how micro-code branches request a different stream of uops from the microcode sequencer, or how the uops retire. If the individual uops don't have to retire separately, perhaps the whole instruction only takes up one slot in the ROB?
When the front-end that feeds the OoO machinery sees a rep movsb
instruction in the uop cache, it activates the Microcode Sequencer ROM (MS-ROM) to send microcode uops into the queue that feeds the issue/rename stage. It's probably not possible for any other uops to mix in with that and issue/execute8 while rep movsb
is still issuing, but subsequent instructions can be fetched/decoded and issue right after the last rep movsb
uop does, while some of the copy hasn't executed yet.
This is only useful if at least some of your subsequent code doesn't depend on the result of the memcpy
(which isn't unusual).
Now, the size of this benefit is limited: at most you can execute N instructions (uops actually) beyond the slow rep movsb
instruction, at which point you'll stall, where N is the ROB size. With current ROB sizes of ~200 (192 on Haswell, 224 on Skylake), that's a maximum benefit of ~200 cycles of free work for subsequent code with an IPC of 1. In 200 cycles you can copy somewhere around 800 bytes at 10 GB/s, so for copies of that size you may get free work close to the cost of the copy (in a way making the copy free).
As copy sizes get much larger, however, the relative importance of this diminishes rapidly (e.g., if you are copying 80 KB instead, the free work is only 1% of the copy cost). Still, it is quite interesting for modest-sized copies.
Copy loops don't totally block subsequent instructions from executing, either. Intel does not go into detail on the size of the benefit, or on what kind of copies or surrounding code there is most benefit. (Hot or cold destination or source, high ILP or low ILP high-latency code after).
Code Size
The executed code size (a few bytes) is microscopic compared to a typical optimized memcpy
routine. If performance is at all limited by i-cache (including uop cache) misses, the reduced code size might be of benefit.
Again, we can bound the magnitude of this benefit based on the size of the copy. I won't actually work it out numerically, but the intuition is that reducing the dynamic code size by B bytes can save at most C * B
cache-misses, for some constant C. Every call to memcpy
incurs the cache miss cost (or benefit) once, but the advantage of higher throughput scales with the number of bytes copied. So for large transfers, higher throughput will dominate the cache effects.
Again, this is not something that will show up in a plain benchmark, where the entire loop will no doubt fit in the uop cache. You'll need a real-world, in-place test to evaluate this effect.
Architecture Specific Optimization
You reported that on your hardware, rep movsb
was considerably slower than the platform memcpy
. However, even here there are reports of the opposite result on earlier hardware (like Ivy Bridge).
That's entirely plausible, since it seems that the string move operations get love periodically - but not every generation, so it may well be faster or at least tied (at which point it may win based on other advantages) on the architectures where it has been brought up to date, only to fall behind in subsequent hardware.
Quoting Andy Glew, who should know a thing or two about this after implementing these on the P6:
the big weakness of doing fast strings in microcode was [...] the
microcode fell out of tune with every generation, getting slower and
slower until somebody got around to fixing it. Just like a library men
copy falls out of tune. I suppose that it is possible that one of the
missed opportunities was to use 128-bit loads and stores when they
became available, and so on.
In that case, it can be seen as just another "platform specific" optimization to apply in the typical every-trick-in-the-book memcpy
routines you find in standard libraries and JIT compilers: but only for use on architectures where it is better. For JIT or AOT-compiled stuff this is easy, but for statically compiled binaries this does require platform specific dispatch, but that often already exists (sometimes implemented at link time), or the mtune
argument can be used to make a static decision.
Simplicity
Even on Skylake, where it seems like it has fallen behind the absolute fastest non-temporal techniques, it is still faster than most approaches and is very simple. This means less time in validation, fewer mystery bugs, less time tuning and updating a monster memcpy
implementation (or, conversely, less dependency on the whims of the standard library implementors if you rely on that).
Latency Bound Platforms
Memory throughput bound algorithms9 can actually be operating in two main overall regimes: DRAM bandwidth bound or concurrency/latency bound.
The first mode is the one that you are probably familiar with: the DRAM subsystem has a certain theoretic bandwidth that you can calculate pretty easily based on the number of channels, data rate/width and frequency. For example, my DDR4-2133 system with 2 channels has a max bandwidth of 2.133 * 8 * 2 = 34.1 GB/s, same as reported on ARK.
You won't sustain more than that rate from DRAM (and usually somewhat less due to various inefficiencies) added across all cores on the socket (i.e., it is a global limit for single-socket systems).
The other limit is imposed by how many concurrent requests a core can actually issue to the memory subsystem. Imagine if a core could only have 1 request in progress at once, for a 64-byte cache line - when the request completed, you could issue another. Assume also very fast 50ns memory latency. Then despite the large 34.1 GB/s DRAM bandwidth, you'd actually only get 64 bytes / 50 ns = 1.28 GB/s, or less than 4% of the max bandwidth.
In practice, cores can issue more than one request at a time, but not an unlimited number. It is usually understood that there are only 10 line fill buffers per core between the L1 and the rest of the memory hierarchy, and perhaps 16 or so fill buffers between L2 and DRAM. Prefetching competes for the same resources, but at least helps reduce the effective latency. For more details look at any of the great posts Dr. Bandwidth has written on the topic, mostly on the Intel forums.
Still, most recent CPUs are limited by this factor, not the RAM bandwidth. Typically they achieve 12 - 20 GB/s per core, while the RAM bandwidth may be 50+ GB/s (on a 4 channel system). Only some recent gen 2-channel "client" cores, which seem to have a better uncore, perhaps more line buffers can hit the DRAM limit on a single core, and our Skylake chips seem to be one of them.
Now of course, there is a reason Intel designs systems with 50 GB/s DRAM bandwidth, while only being to sustain < 20 GB/s per core due to concurrency limits: the former limit is socket-wide and the latter is per core. So each core on an 8 core system can push 20 GB/s worth of requests, at which point they will be DRAM limited again.
Why I am going on and on about this? Because the best memcpy
implementation often depends on which regime you are operating in. Once you are DRAM BW limited (as our chips apparently are, but most aren't on a single core), using non-temporal writes becomes very important since it saves the read-for-ownership that normally wastes 1/3 of your bandwidth. You see that exactly in the test results above: the memcpy implementations that don't use NT stores lose 1/3 of their bandwidth.
If you are concurrency limited, however, the situation equalizes and sometimes reverses, however. You have DRAM bandwidth to spare, so NT stores don't help and they can even hurt since they may increase the latency since the handoff time for the line buffer may be longer than a scenario where prefetch brings the RFO line into LLC (or even L2) and then the store completes in LLC for an effective lower latency. Finally, server uncores tend to have much slower NT stores than client ones (and high bandwidth), which accentuates this effect.
So on other platforms you might find that NT stores are less useful (at least when you care about single-threaded performance) and perhaps rep movsb
wins where (if it gets the best of both worlds).
Really, this last item is a call for most testing. I know that NT stores lose their apparent advantage for single-threaded tests on most archs (including current server archs), but I don't know how rep movsb
will perform relatively...
References
Other good sources of info not integrated in the above.
comp.arch investigation of rep movsb
versus alternatives. Lots of good notes about branch prediction, and an implementation of the approach I've often suggested for small blocks: using overlapping first and/or last read/writes rather than trying to write only exactly the required number of bytes (for example, implementing all copies from 9 to 16 bytes as two 8-byte copies which might overlap in up to 7 bytes).
1 Presumably the intention is to restrict it to cases where, for example, code-size is very important.
2 See Section 3.7.5: REP Prefix and Data Movement.
3 It is key to note this applies only for the various stores within the single instruction itself: once complete, the block of stores still appear ordered with respect to prior and subsequent stores. So code can see stores from the rep movs
out of order with respect to each other but not with respect to prior or subsequent stores (and it's the latter guarantee you usually need). It will only be a problem if you use the end of the copy destination as a synchronization flag, instead of a separate store.
4 Note that non-temporal discrete stores also avoid most of the ordering requirements, although in practice rep movs
has even more freedom since there are still some ordering constraints on WC/NT stores.
5 This is was common in the latter part of the 32-bit era, where many chips had 64-bit data paths (e.g, to support FPUs which had support for the 64-bit double
type). Today, "neutered" chips such as the Pentium or Celeron brands have AVX disabled, but presumably rep movs
microcode can still use 256b loads/stores.
6 E.g., due to language alignment rules, alignment attributes or operators, aliasing rules or other information determined at compile time. In the case of alignment, even if the exact alignment can't be determined, they may at least be able to hoist alignment checks out of loops or otherwise eliminate redundant checks.
7 I'm making the assumption that "standard" memcpy
is choosing a non-temporal approach, which is highly likely for this size of buffer.
8 That isn't necessarily obvious, since it could be the case that the uop stream that is generated by the rep movsb
simply monopolizes dispatch and then it would look very much like the explicit mov
case. It seems that it doesn't work like that however - uops from subsequent instructions can mingle with uops from the microcoded rep movsb
.
9 I.e., those which can issue a large number of independent memory requests and hence saturate the available DRAM-to-core bandwidth, of which memcpy
would be a poster child (and as apposed to purely latency bound loads such as pointer chasing).
memcpy
implementation in current version of compiler is very likely as close to the optimal solution, as you can get with any generic function. If you have some special case like always moving exactly 15 bytes/etc, then maybe a custom asm solution may beat the gcc compiler, but if your C source is vocal enough about what is happening (giving compiler good hints about alignment, length, etc), the compiler will very likely produce optimal machine code even for those specialized cases. You can try to improve the compiler output first. – Bagatellememcpy
. I expect it to be about as good asmemcpy
. I used gdb to step throughmemcpy
and I see that it enters a mainloop withrep movsb
. So that appears to be whatmemcpy
uses anyway (in some cases). – Pedropedrottienhanced rep movsb
is not so enhanced on Skylake (my system)? Still I don't understand why you had to change the order. – Pedropedrottiless /proc/cpuinfo | grep erms
shows erms. – Pedropedrottigdb
to studymemcpy
. For a size defined at run time it used non temporal stores and some prefetching. For the same size (1GB) defined at compile time it usedrep movsb
. I only looked at it once so it's possible I misinterpreted something. My own implementation usingmovntdqa
does about as well asmemcpy
. – Pedropedrottimemcpy()
. Although no one knows what thismemcpy()
is, you can plausibly assume Intel would know how to get >50% of maximum bandwidth on their own chip. – Unhandymemcpy
. You could compare ERMSB to a SSE/AVX solution or better to a solution with non-temporal stores. That's what I would do in this case: use non-temporal stores. But this comment and the comment that followed said even in the 1GB case ERMSB should win. Shouldn't the non-temporal stores prevent the prefetchers from reading the destination? I thought that was the point in using them. – Pedropedrottimemcpy()
uses AVX NT stores. And both NT stores and ERMSB behave in a write-combining fashion, and thus should not require RFO's. Nevertheless, my benchmarks on my own machine show that mymemcpy()
and my ERMSB both cap out at 2/3rds of total bandwidth, like yourmemcpy()
(but not your ERMSB) did Therefore, there is clearly an extra bus transaction somewhere, and it stinks a lot like an RFO. – Unhandyperf
? If you answer the question please share the details. – Pedropedrottilibpfc
. It's nasty, far more limited thanocperf.py
, only known to work on my own machine, only works properly for benching single-threaded code, but because I can easily (re)program the counters and access the timings from within the program, and I can tightly sandwich the code to be benchmarked, it suits my needs. Some day I'll have the time to fix its myriad issues. – Unhandystatic void __movsb(void* dst, const void* src, size_t size) { __asm__ __volatile__("rep movsb" : "+D"(dst), "+S"(src), "+c"(size) : : "memory"); }
which I found here hero.handmade.network/forums/code-discussion/t/… – Pedropedrottiread()
andwrite()
which copy data into user-space: the kernel can't (doesn't) use any SIMD registers or SIMD code, so for a fast memcpy it either has to use 64-bit load/stores, or, more recently it will userep movsb
orrep rmovd
if they are detected to be fast on the architecture. So they get a lot of the benefit of large moves without explicitly needing to usexmm
orymm
regs. – Kimonomemcpy
length or as 1 times? I.e., is your figure a "memory bandwidth" figure or a "memcpy bandwidth" figure? Of course it doesn't change the relative performance between the techniques, but it helps me compare with my system. – Kimonomemcpy
length i.e. the memory bandwidth. Since you have the same processor as me did you test my code in my quesiton on it? If so did you get the same result? You have to compile with-mavx
due to this bug. Try the exact compiler options I usedgcc -O3 -march=native -fopenmp foo.c
. – Pedropedrottimemcpy
(about 13 GB copied, aka 26 GB/s BW), but not for therep movsb
where I see more than 20 GB/s BW, but you report only 15. I will try your code. BTW, I assume you disabled turbo for your tests (which is why you report 2.6 GHz?). I did, although I should have mentioned in explicitly in my answer. – Kimonorep movsb
andmemcpy
respectively with your code. Very oddly inconsistent with your results, since we have the same CPU. There are all sorts of interesting stuff like "memory efficient turbo" that can play heavily here - let me play a bit. That's with turbo off. With turbo on I get roughly 20 vs 25. Turbo seems to help thememcpy
version more than therep movs
version. – Kimonopowersave
governor: about 17.5 GB/s vs 23.5 GB/s. I.e., therep movsb
perf drops but thememcpy
doesn't. Indeed, repeated measurements show that with thepowersave
governor, my CPU only runs at about 2.3 GHz for themovs
benchmark, but at 2.6 GHz for thememcpy
one. So a significant part of the delta in your case is probably explained by power management. Basically power-efficient turbo (hereafter, PET) uses a heuristic to determine if the code is "memory stall bound" and ramps down the CPU since a high frequency is "pointless". – Kimonorep movs
gets unfavorable treatment (performance wise, perhaps it saves power, however!) from PET heuristic, perhaps because the heuristic sees it has a long stall on one instruction, while the highly unrolled AVX version is still executing lots of instructions. I have seen this before while testing some algorithm across a range of parameter values: at some value there is a much larger than expected drop in performance: but what happens is that suddenly the PET threshold was reached and the CPU ramped down (which still hurts performance). – Kimonoperformance
rep movsb
goes as high as 20 GB/s but withpowersave
it gets max 17 GB/s. I added this info to the end of my question. – Pedropedrottipowersave
orperformance
matter in this case if the CPU is running at a constant frequency? – Pedropedrottigrep /proc/cpu MHz
a few times and observe the values, or fire upturbostat
. I ran the benchmark likeperf ./a.out
to make my observation: it tells you the effect GHz for the process. – Kimonopowersave
andperformance
perhaps shouldn't matter (there is still the un-discussed matter of uncore frequencies, which are independent, but no off-the-shelf tool reports them, as far as I know). Furthermore there may be other power saving aspects not directly related to frequency that is controlled by that setting (e.g., the aggressiveness of moving to higher C-states?). – Kimonopowersave
the CPU still idles at 0.8 GHz even with SpeedStep disabled. It's only withperformance
that the CPU is locked at 2.6 GHz with SpeedStep disabled. See the update at the end of my question. – Pedropedrottiintel_pstate
driver will still use P-states to control frequency even if SS is off in the BIOS. You can also useintel_pstate=disable
as a boot parameter to disable it completely, allowing you to use the default power management, including the "user" governer that sets the frequency at whatever you want (no turbo freqs tho). Interesting trivia: withoutintel_pstate
, my chip would never run above 3.4 GHz (i.e., the last 100 MHz of turbo were inaccessible). Withintel_pstate
, no problem. – Kimonorep movsb
performance, but not the whole story (even at equal MHz it's slower). – Kimonorep movsb
(in powersave at the lower freq) versusmemcpy
, but the power (i.e., watts) was only slightly less, and total energy consumed was higher (since it runs longer). So there is no power-saving benefit... – Kimonosudo cpupower -c 0,1,2,3 frequency-set -g performance
- based on my understandingcpuopwer
is the most-up-to-date and maintained of the commands for power management (it can also do things like adjust the "perf bias" on recent Intel chips). Using that command, switching toperformance
doesn't seem to affect turbo. I use this script to enable/disable turbo, although it seems perhaps/sys/devices/system/cpu/intel_pstate/no_turbo
is simpler if you are usingintel_pstate
. – Kimonorep movsb
was slower than explicit copy/store instructions: this effect explained some of the gap. I'm not aware of any discussion of it outside SO: you could search for that question and link it if you find it. I wasn't able to find it but didn't spend much time on it and the SO search returns suspiciously few results. – Kimono