faster alternative to memcpy?
Asked Answered
G

16

82

I have a function that is doing memcpy, but it's taking up an enormous amount of cycles. Is there a faster alternative/approach than using memcpy to move a piece of memory?

Gui answered 3/6, 2010 at 7:5 Comment(3)
It's generally faster not to make a copy at all. Whether you can adapt your function to not copy I don't know but it's worth looking in to.Imalda
Short answer: Maybe, it is possible. Offer more details like architecture, platform and others. In the embedded world it is very probable to rewrite some functions from libc that do not perform so well.Microparasite
Is swapping pointers an option?Fruiterer
H
165

memcpy is likely to be the fastest way you can copy bytes around in memory. If you need something faster - try figuring out a way of not copying things around, e.g. swap pointers only, not the data itself.

Harbot answered 3/6, 2010 at 7:10 Comment(10)
+1, We recently had an issue when some of our code SUDDENLY slowed down tremendously and consumed lots of extra memory when processing a certain file. Turned out the file had some huge metadata block while other flies had no metadata or small blocks. And those metadata was copied, copied, copied, consuming both time and memory. Replaced copying with pass-by-const-reference.Flautist
It's a good question about faster memcpy, but this answer provides a workaround, not an answer. E.g. software.intel.com/en-us/articles/memcpy-performance explains some pretty serious reasons why memcpy is often much less efficient than it could be.Heideheidegger
Could it be possible to use a Copy on Write technique, either at the low level or deliberately in code? Would you need memory chunks of similar sizes to integer multiples of pages? Then you just leave both pointers pointing in real life to the same memory and let the memory manager make copies of pages as it needs to when data is changed.Haubergeon
@Flautist : that's a problem you wouldn't have had in the first place in java or C#. I find it crazy that "slow" JIT-ed languages are by default faster than low level languages like C++. And the only reason is that in java/C# it is difficult to copy something, whereas in C++, copy has extensive compiler support. This is both great and dangerous.Beverleebeverley
@DS: That link appears to have been moved behind some sort of paywall.Shayshaya
The earlier link to Intel post on memcpy no longer seems public, but the article is available here and here.Heideheidegger
this is very far from correct even today. memcpy is usually naive - certainly not the slowest way to copy memory around, but usually quite easy to beat with some loop unrolling, and you can go even further with assembler.Rolypoly
I would assume that before asking this question, all possibilities for "not copying" have been exhausted.Rusch
This answer does not answer the question. The question is a valid question. I would ask stack overflow to remove the "answered" flag.Tacitus
the answer provided some useful suggestion, but it was not an valid answer to the question, I have tested several implementation in assembly, basically using prefetch and bulk operation, it made significant improvement -- this is a comment to give some clue, not an answerShorn
R
65

This is an answer for x86_64 with AVX2 instruction set present. Though something similar may apply for ARM/AArch64 with SIMD.

On Ryzen 1800X with single memory channel filled completely (2 slots, 16 GB DDR4 in each), the following code is 1.56 times faster than memcpy() on MSVC++2017 compiler. If you fill both memory channels with 2 DDR4 modules, i.e. you have all 4 DDR4 slots busy, you may get further 2 times faster memory copying. For triple-(quad-)channel memory systems, you can get further 1.5(2.0) times faster memory copying if the code is extended to analogous AVX512 code. With AVX2-only triple/quad channel systems with all slots busy are not expected to be faster because to load them fully you need to load/store more than 32 bytes at once (48 bytes for triple- and 64-bytes for quad-channel systems), while AVX2 can load/store no more than 32 bytes at once. Though multithreading on some systems can alleviate this without AVX512 or even AVX2.

So here is the copy code that assumes you are copying a large block of memory whose size is a multiple of 32 and the block is 32-byte aligned.

For non-multiple size and non-aligned blocks, prologue/epilogue code can be written reducing the width to 16 (SSE4.1), 8, 4, 2 and finally 1 byte at once for the block head and tail. Also in the middle a local array of 2-3 __m256i values can be used as a proxy between aligned reads from the source and aligned writes to the destination.

#include <immintrin.h>
#include <cstdint>
/* ... */
void fastMemcpy(void *pvDest, void *pvSrc, size_t nBytes) {
  assert(nBytes % 32 == 0);
  assert((intptr_t(pvDest) & 31) == 0);
  assert((intptr_t(pvSrc) & 31) == 0);
  const __m256i *pSrc = reinterpret_cast<const __m256i*>(pvSrc);
  __m256i *pDest = reinterpret_cast<__m256i*>(pvDest);
  int64_t nVects = nBytes / sizeof(*pSrc);
  for (; nVects > 0; nVects--, pSrc++, pDest++) {
    const __m256i loaded = _mm256_stream_load_si256(pSrc);
    _mm256_stream_si256(pDest, loaded);
  }
  _mm_sfence();
}

A key feature of this code is that it skips CPU cache when copying: when CPU cache is involved (i.e. AVX instructions without _stream_ are used), the copy speed drops several times on my system.

My DDR4 memory is 2.6GHz CL13 . So when copying 8GB of data from one array to another I got the following speeds:

memcpy(): 17,208,004,271 bytes/sec.
Stream copy: 26,842,874,528 bytes/sec.

Note that in these measurements the total size of both input and output buffers is divided by the number of seconds elapsed. Because for each byte of the array there are 2 memory accesses: one to read the byte from the input array, another to write the byte to the output array. In the other words, when copying 8GB from one array to another, you do 16GB worth of memory access operations.

Moderate multithreading can further improve performance about 1.44 times, so total increase over memcpy() reaches 2.55 times on my machine. Here's how stream copy performance depends on the number of threads used on my machine:

Stream copy 1 threads: 27114820909.821 bytes/sec
Stream copy 2 threads: 37093291383.193 bytes/sec
Stream copy 3 threads: 39133652655.437 bytes/sec
Stream copy 4 threads: 39087442742.603 bytes/sec
Stream copy 5 threads: 39184708231.360 bytes/sec
Stream copy 6 threads: 38294071248.022 bytes/sec
Stream copy 7 threads: 38015877356.925 bytes/sec
Stream copy 8 threads: 38049387471.070 bytes/sec
Stream copy 9 threads: 38044753158.979 bytes/sec
Stream copy 10 threads: 37261031309.915 bytes/sec
Stream copy 11 threads: 35868511432.914 bytes/sec
Stream copy 12 threads: 36124795895.452 bytes/sec
Stream copy 13 threads: 36321153287.851 bytes/sec
Stream copy 14 threads: 36211294266.431 bytes/sec
Stream copy 15 threads: 35032645421.251 bytes/sec
Stream copy 16 threads: 33590712593.876 bytes/sec

The code is:

void AsyncStreamCopy(__m256i *pDest, const __m256i *pSrc, int64_t nVects) {
  for (; nVects > 0; nVects--, pSrc++, pDest++) {
    const __m256i loaded = _mm256_stream_load_si256(pSrc);
    _mm256_stream_si256(pDest, loaded);
  }
}

void BenchmarkMultithreadStreamCopy(double *gpdOutput, const double *gpdInput, const int64_t cnDoubles) {
  assert((cnDoubles * sizeof(double)) % sizeof(__m256i) == 0);
  const uint32_t maxThreads = std::thread::hardware_concurrency();
  std::vector<std::thread> thrs;
  thrs.reserve(maxThreads + 1);

  const __m256i *pSrc = reinterpret_cast<const __m256i*>(gpdInput);
  __m256i *pDest = reinterpret_cast<__m256i*>(gpdOutput);
  const int64_t nVects = cnDoubles * sizeof(*gpdInput) / sizeof(*pSrc);

  for (uint32_t nThreads = 1; nThreads <= maxThreads; nThreads++) {
    auto start = std::chrono::high_resolution_clock::now();
    lldiv_t perWorker = div((long long)nVects, (long long)nThreads);
    int64_t nextStart = 0;
    for (uint32_t i = 0; i < nThreads; i++) {
      const int64_t curStart = nextStart;
      nextStart += perWorker.quot;
      if ((long long)i < perWorker.rem) {
        nextStart++;
      }
      thrs.emplace_back(AsyncStreamCopy, pDest + curStart, pSrc+curStart, nextStart-curStart);
    }
    for (uint32_t i = 0; i < nThreads; i++) {
      thrs[i].join();
    }
    _mm_sfence();
    auto elapsed = std::chrono::high_resolution_clock::now() - start;
    double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
    printf("Stream copy %d threads: %.3lf bytes/sec\n", (int)nThreads, cnDoubles * 2 * sizeof(double) / nSec);

    thrs.clear();
  }
}

UPDATE 2023-01-18: I don't have that system anymore, but the 2666MHz DDR4 is marked PC4-21300U, meaning 22334668800 bytes/second from one RAM slot. As I had 2 RAM slots, the max bandwidth was 44669337600 bytes/second. And the approach with SIMD and multithreading achieved 87.72% of the theoretical bandwidth when using 5 threads.

Rusch answered 6/7, 2017 at 12:20 Comment(7)
awesome, once I ran into a guide, which is written for Intel X68-64, assembly language using prefetch instruction or something, but I failed to recall what are they exactly... what a coincidence, just found in this thread, by @2009004, final link #1715724Shorn
_mm256_stream_load_si256 only does anything special if copying from WC memory regions (e.g. from video RAM). Otherwise it's just a slower (1 extra uop) vmovdqa on memory you allocated normal (which will be WB = write-back cacheable, strongly ordered, and movntdqa loads, unlike NT stores, don't override the strong ordering). You can't bypass cache for reads from normal memory, only sometimes minimize pollution with NT prefetch. (But that's hard to tune and depends on the machine, not just the code.)Archduchy
Enhanced REP MOVSB for memcpy has some details on why NT stores (or rep movsb on an ERMSB CPU) can be a win for huge copies. For small to medium copies, bypassing the cache is a big downside if you're going to read the memory again any time soon.Archduchy
A good memcpy (like glibc's on GNU/Linux) will use NT stores above a certain size threshold, or simply use rep movsb on some CPUs. If your C implementation's memcpy doesn't already do that, or you know this copy should be non-temporal, then yeah it could make sense to do it manually.Archduchy
If your 2 sticks of RAM are installed correctly, one DIMM on each channel, you're already using dual channel. Another pair of DIMMs won't make it faster.Archduchy
What would be the theoretical max memory bandwidth (GB/s) for your configured system?Yukoyukon
@sham1810, I don't have that system anymore, but the 2666MHz DDR4 is marked PC4-21300U, meaning 22334668800 bytes/second from one RAM slot. As I had 2 RAM slots, the max bandwidth was 44669337600 bytes/second. And the approach with SIMD and multithreading achieved 87.72% of the bandwidth.Rusch
M
13

Please offer us more details. On i386 architecture it is very possible that memcpy is the fastest way of copying. But on different architecture for which the compiler doesn't have an optimized version it is best that you rewrite your memcpy function. I did this on a custom ARM architecture using assembly language. If you transfer BIG chunks of memory then DMA is probably the answer you are looking for.

Please offer more details - architecture, operating system (if relevant).

Microparasite answered 3/6, 2010 at 8:53 Comment(2)
For ARM the libc impl is now faster that what you will be able to create yourself. For small copies (anything less then a page) it can be faster to use a ASM loop inside your functions. But, for large copies you will not be able to beat the libc impl, because diff processors have slightly different "most optimal" code paths. For example a Cortex8 works best with NEON copy instructions, but a Cortex9 is faster with ldm/stm ARM instructions. You cannot write one piece of code that is fast for both processors, but you can just call memcpy for large buffers.Airdrop
@MoDJ: I wish the standard C library had included a few different memcpy variants with generally-identical semantics in cases where all yielded defined behavior, but different optimized cases and--in somes--restrictions to aligned-vs-aligned usage. If code will typically need to copy small numbers of bytes or known-to-be-aligned words, a naive character-at-a-time implementation could do the job in less time than some fancier memcpy() implementations would require to decide on a course of action.Serpentiform
F
8

Usually the standard library shipped with the compiler will implement memcpy() the fastest way possible for the target platform already.

Flautist answered 3/6, 2010 at 7:8 Comment(0)
R
8

Actually, memcpy is NOT the fastest way, especially if you call it many times. I also had some code that I really needed to speed up, and memcpy is slow because it has too many unnecessary checks. For example, it checks to see if the destination and source memory blocks overlap and if it should start copying from the back of the block rather than the front. If you do not care about such considerations, you can certainly do significantly better. I have some code, but here is perhaps an ever better version:

Very fast memcpy for image processing?.

If you search, you can find other implementations as well. But for true speed you need an assembly version.

Rottenstone answered 24/1, 2013 at 20:59 Comment(4)
I tried out code similar to this using sse2. Turns out it was slower on my amd system by a factor of 4x than the builtin. It's always better not to copy if you can help it.Halfmast
Although memmove must check for and handle overlap, memcpy is not required to do so. The bigger problem is that in order to be efficient when copying large blocks, implementations of memcpy need to select a copying approach before they can begin work. If code needs to be able to copy an arbitrary number of bytes, but that number will be one 90% of the time, two 9% of the time, three 0.9% of the time, etc. and the values of count, dest, and src won't be needed afterward, then an in-lined if (count) do *dest+=*src; while(--count > 0); could better than "smarter" routine.Serpentiform
BTW, on some embedded systems, another reason memcpy may not be the fastest approach is that a DMA controller may sometimes be able to copy a block of memory with less overhead than the CPU, but the most efficient way to do the copy might be to start the DMA and then do other processing while the DMA is running. On a system with separate front-end code and data buses, it may be possible to configure the DMA so that it will copy data on every cycle when the CPU doesn't need the data bus for anything else. This may achieve much better performance than using the CPU for the copy, using...Serpentiform
...start_memcpy() and await_memcpy_complete() functions, but any code would generally have to be customized for particular application requirements and nothing like that is included in the standard library.Serpentiform
E
4

Sometimes functions like memcpy, memset, ... are implemented in two different ways:

  • once as a real function
  • once as some assembly that's immediately inlined

Not all compilers take the inlined-assembly version by default, your compiler may use the function variant by default, causing some overhead because of the function call. Check your compiler to see how to take the intrinsic variant of the function (command line option, pragma's, ...).

Edit: See http://msdn.microsoft.com/en-us/library/tzkfha43%28VS.80%29.aspx for an explanation of intrinsics on the Microsoft C compiler.

Endres answered 3/6, 2010 at 7:11 Comment(0)
B
4

Here is an alternative C version of memcpy that is inlineable and I find it outperforms memcpy for GCC for Arm64 by about 50% in the application I used it for. It is 64-bit platform independent. The tail processing can be removed if the usage instance does not need it for a bit more speed. Copies uint32_t arrays, smaller datatypes not tested but might work. Might be able to adapt for other datatypes. 64-bit copy (two indexes are copied simultaneously). 32-bit should also work but slower. Credits to Neoscrypt project.

    static inline void newmemcpy(void *__restrict__ dstp, 
                  void *__restrict__ srcp, uint len)
        {
            ulong *dst = (ulong *) dstp;
            ulong *src = (ulong *) srcp;
            uint i, tail;

            for(i = 0; i < (len / sizeof(ulong)); i++)
                *dst++ = *src++;
            /*
              Remove below if your application does not need it.
              If console application, you can uncomment the printf to test
              whether tail processing is being used.
            */
            tail = len & (sizeof(ulong) - 1);
            if(tail) {
                //printf("tailused\n");
                uchar *dstb = (uchar *) dstp;
                uchar *srcb = (uchar *) srcp;

                for(i = len - tail; i < len; i++)
                    dstb[i] = srcb[i];
            }
        }
Broomfield answered 27/8, 2018 at 16:36 Comment(3)
this is slower on m1 macsStammel
"*dst++ = *src++;" this line has memory protection overhead. a block implementation by the system needs to check bounds only once.Phytosociology
Surprisingly, this generates a call to std::memcpy in MSVC.Hokusai
D
4

This question is 12 years old as I write yet another answer. But then it comes up in searches still and the answers are always evolving.

Surprised no one mentioned Agner Fog's asmlib yet.
A drop in replacement for memcpy() plus many other SIMD optimized C lib replacements like memmove(), memset(), strlen(), etc.
Will automatically use the best your CPU supports up to the AVX-512 instruction set. Comes with prebuilt libs for several x86/AMD64 platforms.

Dilatant answered 15/2, 2022 at 6:34 Comment(0)
A
3

You should check the assembly code generated for your code. What you don't want is to have the memcpy call generate a call to the memcpy function in the standard library - what you want is to have a repeated call to the best ASM instruction to copy the largest amount of data - something like rep movsq.

How can you achieve this? Well, the compiler optimizes calls to memcpy by replacing it with simple movs as long as it knows how much data it should copy. You can see this if you write a memcpy with a well determined (constexpr) value. If the compiler doesn't know the value, it will have to fall back to the byte-level implementation of memcpy - the issue being that memcpy has to respect the one-byte granularity. It will still move 128 bits at a time, but after each 128b it will have to check if it has enough data to copy as 128b or it has to fall back to 64bits, then to 32 and 8 (I think that 16 might be suboptimal anyway, but I don't know for sure).

So what you want is either be able to tell to memcpy what's the size of your data with const expressions that the compiler can optimize. This way no call to memcpy is performed. What you don't want is to pass to memcpy a variable that will only be known at run-time. That translates into a function call and tons of tests to check the best copy instruction. Sometimes, a simple for loop is better than memcpy for this reason (eliminating one function call). And what you really really don't want is pass to memcpy an odd number of bytes to copy.

Apeman answered 24/12, 2015 at 9:39 Comment(0)
R
2

Check you Compiler/Platform manual. For some micro-processors and DSP-kits using memcpy is much slower than intrinsic functions or DMA operations.

Rabah answered 3/6, 2010 at 7:58 Comment(0)
F
2

If your platform supports it, look into if you can use the mmap() system call to leave your data in the file... generally the OS can manage that better. And, as everyone has been saying, avoid copying if at all possible; pointers are your friend in cases like this.

Fallon answered 3/6, 2010 at 8:16 Comment(0)
M
2

Here's some benchmarks Visual C++/Ryzen 1700.

The benchmark copies 16 KiB (non-overlapping) chunks of data from a 128 MiB ring buffer 8*8192 times (in total, 1 GiB of data is copied).

I then normalize the result, here we present wall clock time in milliseconds and a throughput value for 60 Hz (i.e. how much data can this function process over 16.667 milliseconds).

memcpy                           2.761 milliseconds ( 772.555 MiB/frame)

As you can see the builtin memcpy is fast, but how fast?

64-wide load/store              39.889 milliseconds (  427.853 MiB/frame)
32-wide load/store              33.765 milliseconds (  505.450 MiB/frame)
16-wide load/store              24.033 milliseconds (  710.129 MiB/frame)
 8-wide load/store              23.962 milliseconds (  712.245 MiB/frame)
 4-wide load/store              22.965 milliseconds (  743.176 MiB/frame)
 2-wide load/store              22.573 milliseconds (  756.072 MiB/frame)
 1-wide load/store              35.032 milliseconds (  487.169 MiB/frame)

The above is just the code below with variations of n.

// n is the "wideness" from the benchmark

auto src = (__m128i*)get_src_chunk();
auto dst = (__m128i*)get_dst_chunk();

for (int32_t i = 0; i < (16 * 1024) / (16 * n); i += n) {
  __m128i temp[n];

  for (int32_t i = 0; i < n; i++) {
    temp[i] = _mm_loadu_si128(dst++);
  }

  for (int32_t i = 0; i < n; i++) {
    _mm_store_si128(src++, temp[i]);
  }
}

These are my best guesses for the results that I have. Based on what I know about the Zen microarchitecture it can only fetch 32 bytes per cycle. That's why we max out at 2x 16-byte load/store.

  • The 1x load the bytes into xmm0, 128-bit
  • The 2x load the bytes into ymm0, 256-bit

And that's why it is about twice as fast, and internally exactly what memcpy does (or what it should be doing if you enable the right optimizations for your platform).

It is also impossible to make this faster since we are now limited by the cache bandwidth which doesn't go any faster. I think this is a quite important fact to point our because if you are memory bound and looking for faster solution, you will be looking for a very long time.

Magnuson answered 5/8, 2020 at 18:37 Comment(0)
T
1

I assume you must have huge areas of memory that you want to copy around, if the performance of memcpy has become an issue for you?

In this case, I'd agree with nos's suggestion to figure out some way NOT to copy stuff..

Instead of having one huge blob of memory to be copied around whenever you need to change it, you should probably try some alternative data structures instead.

Without really knowing anything about your problem area, I would suggest taking a good look at persistent data structures and either implementing one of your own or reusing an existing implementation.

Tabulate answered 3/6, 2010 at 7:32 Comment(0)
T
1

This function could cause data abort exception if one of the pointers (input arguments) are not aligned to 32bits.

Tubercular answered 20/10, 2018 at 18:58 Comment(0)
P
0

You may want to have a look at this:

http://www.danielvik.com/2010/02/fast-memcpy-in-c.html

Another idea I would try is to use COW techniques to duplicate the memory block and let the OS handle the copying on demand as soon as the page is written to. There are some hints here using mmap(): Can I do a copy-on-write memcpy in Linux?

Prajna answered 3/6, 2010 at 9:29 Comment(0)
C
0

memory to memory is usually supported in CPU's command set, and memcpy will usually use that. And this is usually the fastest way.

You should check what exactly your CPU is doing. On Linux, watch for swapi in and out and virtual memory effectiveness with sar -B 1 or vmstat 1 or by looking in /proc/memstat. You may see that your copy has to push out a lot of pages to free space, or read them in, etc.

That would mean your problem isn't in what you use for the copy, but how your system uses memory. You may need to decrease file cache or start writing out earlier, or lock the pages in memory, etc.

Cyanine answered 24/8, 2010 at 8:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.