Why are memcpy() and memmove() faster than pointer increments?
Asked Answered
M

10

95

I am copying N bytes from pSrc to pDest. This can be done in a single loop:

for (int i = 0; i < N; i++)
    *pDest++ = *pSrc++

Why is this slower than memcpy or memmove? What tricks do they use to speed it up?

Machado answered 15/10, 2011 at 5:50 Comment(14)
Your loop only copies one location. I think you somehow meant to increment the pointers.Southern
Or, you could just fix it for them, like I did. And, BTW, no true C programmer ever counts from 1 to N, it's always from 0 to N-1 :-)Atelectasis
@paxdiablo: If you're looping over arrays, sure. But there are plenty of cases where looping from 1 to N is just fine. Depends on what you're doing with the data -- if you're displaying a numbered list starting at 1, for example, to a user, then starting at 1 probably makes more sense. In any case, it ignores the bigger problem that is using int as the counter when an unsigned type like size_t should be used instead.Naphthol
@Atelectasis You could also count from N to 1. On some processors that will eliminate one compare instruction as the decrement will set the appropriate bit for the branch instruction when it reaches zero.Dottydoty
I think the premise of the question is false. Modern compilers will convert this into memcpy or memmove (depending on whether they can tell if the pointers might alias).Lubbi
@David: Really? Considering the common compiler stages, I'd be surprised. memcpy is typically inlined early, and peephole optimizations applied later. I wouldn't be surprised if the resulting assembly would be identical, though.Correggio
Also, != is faster than <. ;)Antipersonnel
@onemasse: But that doesn't mean it's faster! I remember from benchmarking memcpy() on a Pentium3 that copying memory was faster with positive increments than with negative increments. The memory hierarchy just liked it better. Go figure!Beth
@Mackie Messer Of course, you shouldn't use i for addressing. But just using it as a counter should be ok. If you copy backwards you'll fool the cache if you don't use preload and stuff.Dottydoty
@onemasse: I think the moral is this: just do as everybody does and stick to the languages idioms. People, compiler and hardware will handle it better. People will recognize the intent of the loop faster. The compiler could very well apply the optimization you describe without your help. And the hardware is always optimized for the common case. Keep your code simple, stupid. You can still write clever comments. ;-)Beth
@Mackie Messer You're absolutely right. You should never optimize prematurely. With modern compilers and architectures you often don't have to optimize at all. But in embedded environments, it's often not so easy to avoid. But you should always measure the performance both before and after the optimizations, preferably with a profiler.Dottydoty
@Antipersonnel that was true on the 6502, since then it doesn't matter.Lienlienhard
Possible duplicate of C++ memcpy() vs std::copy() That does not mention memmove, but the discussion is going to be the same.Schou
Because of optimizations like those detailed here by Joe Bialek: January 11, 2021, "Building Faster AMD64 Memset Routines", msrc-blog.microsoft.com/2021/01/11/… last accessed May 17, 2021.Apostate
D
131

Because memcpy uses word pointers instead of byte pointers, also the memcpy implementations are often written with SIMD instructions which makes it possible to shuffle 128 bits at a time.

SIMD instructions are assembly instructions that can perform the same operation on each element in a vector up to 16 bytes long. That includes load and store instructions.

Dottydoty answered 15/10, 2011 at 5:54 Comment(9)
When you turn GCC up to -O3, it will use SIMD for the loop, at least if it knows pDest and pSrc don't alias.Steelhead
I'm currently working on a Xeon Phi with 64 bytes (512 bits) SIMD, so this stuff of "up to 16 bytes" makes me smile. In addition, you must specify what CPU you're targeting for SIMD to be enabled, for example with -march=native.Obligee
Maybe I should revise my answer. :)Dottydoty
This is highly outdated even at the posting time. AVX vectors on x86 (shipped in 2011) are 32 bytes long, and AVX-512 are 64-byte long. There are some architectures with 1024-bit or 2048-bit vectors, or even variable vector width like ARM SVEBeanpole
@Beanpole while the instructions may have been available then, have you any evidence that memcpy uses them? It normally takes a while for the libraries to catch up, and the latest ones I can find use SSSE3 and are much more recent than 2011.Martymartyn
@PeteKirkham that means you've opened the wrong file (because platform-specific code is generally in a different place than the generic code) or downloaded an older version. For example glibc have lots of AVX and AVX-512 versions of memcpy and similar functions. In fact open source libraries adopt newer SIMD ISAs very early, before because manufacturers released the extension and the simulator long before they came into hardwareBeanpole
@Beanpole accepted that it's used for the current versions, but the history for the earliest memcpy-AVX files only goes back to 2014-07-30 Ling Ma, which means it took three years for the library to catch up, and the answer was correct in 2011.Martymartyn
@PeteKirkham yes but it's irrelevant in this case. I'm just addressing the a vector up to 16 bytes long part which is wrong and I'm not talking anything about the libraryBeanpole
@Beanpole fair enough. perhaps it would have been better to edit the answer and improve it rather than a comment that could be misunderstood.Martymartyn
P
83

Memory copy routines can be far more complicated and faster than a simple memory copy via pointers such as:

void simple_memory_copy(void* dst, void* src, unsigned int bytes)
{
  unsigned char* b_dst = (unsigned char*)dst;
  unsigned char* b_src = (unsigned char*)src;
  for (int i = 0; i < bytes; ++i)
    *b_dst++ = *b_src++;
}

Improvements

The first improvement one can make is to align one of the pointers on a word boundary (by word I mean native integer size, usually 32 bits/4 bytes, but can be 64 bits/8 bytes on newer architectures) and use word sized move/copy instructions. This requires using a byte to byte copy until a pointer is aligned.

void aligned_memory_copy(void* dst, void* src, unsigned int bytes)
{
  unsigned char* b_dst = (unsigned char*)dst;
  unsigned char* b_src = (unsigned char*)src;

  // Copy bytes to align source pointer
  while ((b_src & 0x3) != 0)
  {
    *b_dst++ = *b_src++;
    bytes--;
  }

  unsigned int* w_dst = (unsigned int*)b_dst;
  unsigned int* w_src = (unsigned int*)b_src;
  while (bytes >= 4)
  {
    *w_dst++ = *w_src++;
    bytes -= 4;
  }

  // Copy trailing bytes
  if (bytes > 0)
  {
    b_dst = (unsigned char*)w_dst;
    b_src = (unsigned char*)w_src;
    while (bytes > 0)
    {
      *b_dst++ = *b_src++;
      bytes--;
    }
  }
}

Different architectures will perform differently based on if the source or the destination pointer is appropriately aligned. For instance on an XScale processor I got better performance by aligning the destination pointer rather than the source pointer.

To further improve performance some loop unrolling can be done, so that more of the processor's registers are loaded with data and that means the load/store instructions can be interleaved and have their latency hidden by additional instructions (such as loop counting etc). The benefit this brings varies quite a bit by the processor, since load/store instruction latencies can be quite different.

At this stage the code ends up being written in Assembly rather than C (or C++) since you need to manually place the load and store instructions to get maximum benefit of latency hiding and throughput.

Generally a whole cache line of data should be copied in one iteration of the unrolled loop.

Which brings me to the next improvement, adding pre-fetching. These are special instructions that tell the processor's cache system to load specific parts of memory into its cache. Since there is a delay between issuing the instruction and having the cache line filled, the instructions need to be placed in such a way so that the data is available when just as it is to be copied, and no sooner/later.

This means putting prefetch instructions at the start of the function as well as inside the main copy loop. With the prefetch instructions in the middle of the copy loop fetching data that will be copied in several iterations time.

I can't remember, but it may also be beneficial to prefetch the destination addresses as well as the source ones.

Factors

The main factors that affect how fast memory can be copied are:

  • The latency between the processor, its caches, and main memory.
  • The size and structure of the processor's cache lines.
  • The processor's memory move/copy instructions (latency, throughput, register size, etc).

So if you want to write an efficient and fast memory cope routine you'll need to know quite a lot about the processor and architecture you are writing for. Suffice to say, unless you're writing on some embedded platform it would be much easier to just use the built in memory copy routines.

Pesade answered 15/10, 2011 at 6:39 Comment(5)
Modern CPUs will detect a linear memory access pattern and start prefetching on their own. I expect that prefetch instructions would not make much difference because of that.Quintic
@Quintic On the few architectures that I've implemented memory copy routines adding the prefetch has helped measurably. While it may be true that the current generation Intel/AMD chips do prefetch far enough ahead, there are plenty of older chips and other architectures that do not.Pesade
can anyone explain "(b_src & 0x3) != 0" ? I can't understand it, and also - it won't compile (throws an error: invalid operator to binary &: unsigned char and int);Cerebellum
"(b_src & 0x3) != 0" is checking if the lowest 2 bits are not 0. So if the source pointer is aligned to a multiple of 4 bytes or not. Your compile error happens because it is treating the 0x3 as a byte not an in, you can fix that by using 0x00000003 or 0x3i (I think).Pesade
b_src & 0x3 won't compile because you're not allowed to do bitwise arithmetic on pointer types. You must cast it to (u)intptr_t firstBeanpole
G
18

memcpy can copy more than one byte at once depending on the computer's architecture. Most modern computers can work with 32 bits or more in a single processor instruction.

From one example implementation:

    00026          * For speedy copying, optimize the common case where both pointers
    00027          * and the length are word-aligned, and copy word-at-a-time instead
    00028          * of byte-at-a-time. Otherwise, copy by bytes.
Godroon answered 15/10, 2011 at 5:51 Comment(3)
On a 386 (for one example), which had no on-board cache, this did make a huge difference. On most modern processors, the reads and writes will happen one cache-line at a time, and the bus to memory will usually be the bottleneck, so expect an improvement of a few percent, not anywhere close to quadruple.Vilberg
I think you should be a bit more explicit when you say "from the source". Sure, that's "the source" on some architectures, but it's certainly not on, say, a BSD or Windows machine. (And hell, even between GNU systems there's often a lot of difference in this function)Naphthol
@Billy ONeal: +1 absolutely right... there's more than one way to skin a cat. That was just one example. Fixed! Thanks for the constructive comment.Godroon
M
7

You can implement memcpy() using any of the following techniques, some dependent on your architecture for performance gains, and they will all be much faster than your code:

  1. Use larger units, such as 32-bit words instead of bytes. You can also (or may have to) deal with alignment here as well. You can't go reading/writing a 32-bit word to a odd memory location for example on some platforms, and on other platforms you pay a massive performance penalty. To fix this, the address has to be a unit divisible by 4. You can take this up to 64-bits for 64bit CPUs, or even higher using SIMD (Single instruction, multiple data) instructions (MMX, SSE, etc.)

  2. You can use special CPU instructions that your compiler may not be able to optimize from C. For example, on a 80386, you can use the "rep" prefix instruction + "movsb" instruction to move N bytes dictated by placing N in the count register. Good compilers will just do this for you, but you may be on a platform that lacks a good compiler. Note, that example tends to be a bad demonstration of speed, but combined with alignment + larger unit instructions, it can be faster than mostly everything else on certain CPUs.

  3. Loop unrolling -- branches can be quite expensive on some CPUs, so unrolling the loops can lower the number of branches. This is also a good technique for combining with SIMD instructions and very large sized units.

For example, http://www.agner.org/optimize/#asmlib has a memcpy implementation that beats most out there (by a very tiny amount). If you read the source code, it will be full of tons of inlined assembly code that pulls off all of the above three techniques, choosing which of those techniques based on what CPU you are running on.

Note, there are similar optimizations that can be made for finding bytes in a buffer too. strchr() and friends will often by faster than your hand rolled equivalent. This is especially true for .NET and Java. For example, in .NET, the built-in String.IndexOf() is much faster than even a Boyer–Moore string search, because it uses the above optimization techniques.

Myxoma answered 15/10, 2011 at 9:39 Comment(2)
The same Agner Fog you're linking to also theorizes that loop unrolling is counterproductive on modern CPUs.Sawtelle
Most CPUs nowadays have good branch prediction, which should negate the benefit of loop unrolling in typical cases. A good optimising compiler can still use it sometimes.Cooperation
M
6

I don't know whether it is actually used in any real-world implementations of memcpy, but I think Duff's Device deserves a mention here.

From Wikipedia:

send(to, from, count)
register short *to, *from;
register count;
{
        register n = (count + 7) / 8;
        switch(count % 8) {
        case 0:      do {     *to = *from++;
        case 7:              *to = *from++;
        case 6:              *to = *from++;
        case 5:              *to = *from++;
        case 4:              *to = *from++;
        case 3:              *to = *from++;
        case 2:              *to = *from++;
        case 1:              *to = *from++;
                } while(--n > 0);
        }
}

Note that the above isn't a memcpy since it deliberately doesn't increment the to pointer. It implements a slightly different operation: the writing into a memory-mapped register. See the Wikipedia article for details.

Mebane answered 15/10, 2011 at 7:44 Comment(5)
Duff's device, or just the initial jump mechanism, is a good use to copy the first 1..3 (or 1..7) bytes so that the pointers are aligned to a nicer boundary where bigger memory move instructions can be used.Pesade
@MarkByers: The code illustrates a slightly different operation (*to refers to a memory-mapped register and is deliberately not incremented -- see the linked-to article). As I thought I made clear, my answer doesn't attempt to provide an efficient memcpy, it simply mentions a rather curious technique.Mebane
@Daemin Agreed, as you said you can skip the do {} while() and the switch will be translated to a jump table by the compiler. Very useful when you want to take care of the remaining data. A warning should be mentioned about Duff's device, apparently on newer architectures (newer x86), branch prediction is so efficient that Duff's device is actually slower than a simple loop.Dottydoty
Oh no.. not Duff's device. Please don't use Duff's device. Please. Use PGO and let me compiler do loop unrolling for you where it makes sense.Naphthol
No, Duff's device is most definitely not used in any modern implementation.Madoc
P
5

Short answer:

  • cache fill
  • wordsize transfers instead of byte ones where possible
  • SIMD magic
Picrate answered 15/10, 2011 at 20:52 Comment(0)
L
3

Like others say memcpy copies larger than 1-byte chunks. Copying in word sized chunks is much faster. However, most implementations take it a step further and run several MOV (word) instructions before looping. The advantage to copying in say, 8 word blocks per loop is that the loop itself is costly. This technique reduces the number of conditional branches by a factor of 8, optimizing the copy for giant blocks.

Lisettelisha answered 15/10, 2011 at 5:58 Comment(5)
I don't think this is true. You can unroll the loop, but you can't copy in a single instruction more data than addressable at a time on the target architecture. Plus, there's overhead of unrolling the loop too...Naphthol
@Billy ONeal: I don't think that's what VoidStar meant. By having several consecutive move instructions the overhead of counting the number of units is decreased.Dania
@Billy ONeal: You're missing the point. 1-word at a time is like MOV, JMP, MOV, JMP, etc. Where as you can do MOV MOV MOV MOV JMP. I've written mempcy before and I've benchmarked a lot of ways of doing it ;)Lisettelisha
@wallyk: Perhaps. But he says "copy even larger chunks" -- which aren't really possible. If he means loop unrolling, then he should say "most implementations take it a step further and unroll the loop." The answer as written is at best misleading, at worst wrong.Naphthol
@VoidStar: Agreed --- it's better now. +1.Naphthol
B
2

The answers are great, but if you still want implement a fast memcpy yourself, there is an interesting blog post about fast memcpy, Fast memcpy in C.

void *memcpy(void* dest, const void* src, size_t count)
{
    char* dst8 = (char*)dest;
    char* src8 = (char*)src;

    if (count & 1) {
        dst8[0] = src8[0];
        dst8 += 1;
        src8 += 1;
    }

    count /= 2;
    while (count--) {
        dst8[0] = src8[0];
        dst8[1] = src8[1];

        dst8 += 2;
        src8 += 2;
    }
    return dest;
}

Even, it can be better with optimizing memory accesses.

Bentinck answered 15/10, 2011 at 7:41 Comment(0)
G
1

Because like many library routines it has been optimized for the architecture you are running on. Others have posted various techniques which can be used.

Given the choice, use library routines rather than roll your own. This is a variation on DRY that I call DRO (Don't Repeat Others). Also, library routines are less likely be wrong than your own implementation.

I have seen memory access checkers complain about out of bounds reads on memory or string buffers which were not a multiple of the word size. This is a result of the optimization being used.

Gunnar answered 15/10, 2011 at 19:55 Comment(0)
M
1

You can look at the MacOS implementation of memset, memcpy and memmove.

At boot time, the OS determines which processor it's running on. It has built in specifically optimised code for each supported processor, and at boot time stores a jmp instruction to the right code in a fixed read/only location.

The C memset, memcpy and memmove implementations are just a jump to that fixed location.

The implementations use different code depending on alignment of source and destination for memcpy and memmove. They obviously use all available vector capabilities. They also use non-caching variants when you copy large amounts of data, and have instructions to minimise waits for page tables. It's not just assembler code, it's assembler code written by someone with extremely good knowledge of each processor architecture.

Intel also added assembler instructions that can make string operations faster. For example with an instruction to support strstr which does 256 byte compares in one cycle.

Madoc answered 29/2, 2020 at 11:17 Comment(2)
Apple's open source version of memset/memcpy/memmove is just a generic version which will be a lot slower than the real version using SIMDBeanpole
Where can I find the real Mac OS implementation?Prepositive

© 2022 - 2024 — McMap. All rights reserved.