Very fast memcpy for image processing?
Asked Answered
B

8

41

I am doing image processing in C that requires copying large chunks of data around memory - the source and destination never overlap.

What is the absolute fastest way to do this on the x86 platform using GCC (where SSE, SSE2 but NOT SSE3 are available)?

I expect the solution will either be in assembly or using GCC intrinsics?

I found the following link but have no idea whether it's the best way to go about it (the author also says it has a few bugs): http://coding.derkeiler.com/Archive/Assembler/comp.lang.asm.x86/2006-02/msg00123.html

EDIT: note that a copy is necessary, I cannot get around having to copy the data (I could explain why but I'll spare you the explanation :))

Barmecide answered 11/11, 2009 at 13:40 Comment(7)
can you write your code so the copy isn't required in the first place?Buffalo
If you can get a hold of the Intel compiler you might have better chances of the optimizer converting into vector cpu instructionsCardoza
Take a look at this: software.intel.com/en-us/articles/memcpy-performanceCardoza
Do you know by how much too slow your compiler's memcpy() is? Can you specify what processor the code will run on? And What OS?Grass
I suppose that you realise that keeping the memory blocks 16-byte aligned will help. Or, if they are not 16-byte aligned, then handle the first few and last few bytes as a special case, and copy the rest of the block on 16-byte aligned boundaries.Rosamariarosamond
Also, read Intel's advice on fast memcpy with GCC software.intel.com/en-us/articles/memcpy-performanceRosamariarosamond
I don't know what's best for you but in regards to memcpy there are faster versions. Try Agner Fog's asmlib (google it). It has assembly optimized functions such as A_memcpy and A_memmove which should be faster than memcpyBrunette
P
51

Courtesy of William Chan and Google. 30-70% faster than memcpy in Microsoft Visual Studio 2005.

void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size)
{

  __asm
  {
    mov esi, src;    //src pointer
    mov edi, dest;   //dest pointer

    mov ebx, size;   //ebx is our counter 
    shr ebx, 7;      //divide by 128 (8 * 128bit registers)


    loop_copy:
      prefetchnta 128[ESI]; //SSE2 prefetch
      prefetchnta 160[ESI];
      prefetchnta 192[ESI];
      prefetchnta 224[ESI];

      movdqa xmm0, 0[ESI]; //move data from src to registers
      movdqa xmm1, 16[ESI];
      movdqa xmm2, 32[ESI];
      movdqa xmm3, 48[ESI];
      movdqa xmm4, 64[ESI];
      movdqa xmm5, 80[ESI];
      movdqa xmm6, 96[ESI];
      movdqa xmm7, 112[ESI];

      movntdq 0[EDI], xmm0; //move data from registers to dest
      movntdq 16[EDI], xmm1;
      movntdq 32[EDI], xmm2;
      movntdq 48[EDI], xmm3;
      movntdq 64[EDI], xmm4;
      movntdq 80[EDI], xmm5;
      movntdq 96[EDI], xmm6;
      movntdq 112[EDI], xmm7;

      add esi, 128;
      add edi, 128;
      dec ebx;

      jnz loop_copy; //loop please
    loop_copy_end:
  }
}

You may be able to optimize it further depending on your exact situation and any assumptions you are able to make.

You may also want to check out the memcpy source (memcpy.asm) and strip out its special case handling. It may be possible to optimise further!

Pesky answered 11/11, 2009 at 14:8 Comment(3)
Note: the performance of this memcopy will be wildly dependant on the quantity of data to copy and the cache size. For instance, prefetchs and non-temporal moves may bog down the performance for smaller (fitting into L2) copies compared to regular movdqa's.Psychiatry
banister: don't forget to mail him that you used his code in your project ;) [ williamchan.ca/portfolio/assembly/ssememcpy/source/… ]Cosentino
I remember reading this code in an AMD64 manual first. And the code isn't optimal on intel, where it has cache bank aliasing issues.Calvillo
C
9

The SSE-Code posted by hapalibashi is the way to go.

If you need even more performance and don't shy away from the long and winding road of writing a device-driver: All important platforms nowadays have a DMA-controller that is capable of doing a copy-job faster and in parallel to CPU code could do.

That involves writing a driver though. No big OS that I'm aware of exposes this functionality to the user-side because of the security risks.

However, it may be worth it (if you need the performance) since no code on earth could outperform a piece of hardware that is designed to do such a job.

Chabot answered 12/11, 2009 at 17:31 Comment(1)
I've just posted an answer that talks about the bandwidth of RAM. If what I say is true, then I don't think the DMA engine could achieve much beyond what the CPU can achieve. Have I missed something?Flaunch
F
8

This question is four years old now and I'm a little surprised nobody has mentioned memory bandwidth yet. CPU-Z reports that my machine has PC3-10700 RAM. That the RAM has a peak bandwidth (aka transfer rate, throughput etc) of 10700 MBytes/sec. The CPU in my machine is an i5-2430M CPU, with peak turbo frequency of 3 GHz.

Theoretically, with an infinitely fast CPU and my RAM, memcpy could go at 5300 MBytes/sec, ie half of 10700 because memcpy has to read from and then write to RAM. (edit: As v.oddou pointed out, this is a simplistic approximation).

On the other hand, imagine we had infinitely fast RAM and a realistic CPU, what could we achieve? Let's use my 3 GHz CPU as an example. If it could do a 32-bit read and a 32-bit write each cycle, then it could transfer 3e9 * 4 = 12000 MBytes/sec. This seems easily within reach for a modern CPU. Already, we can see that the code running on the CPU isn't really the bottleneck. This is one of the reasons that modern machines have data caches.

We can measure what the CPU can really do by benchmarking memcpy when we know the data is cached. Doing this accurately is fiddly. I made a simple app that wrote random numbers into an array, memcpy'd them to another array, then checksumed the copied data. I stepped through the code in the debugger to make sure that the clever compiler had not removed the copy. Altering the size of the array alters the cache performance - small arrays fit in the cache, big ones less so. I got the following results:

  • 40 KByte arrays: 16000 MBytes/sec
  • 400 KByte arrays: 11000 MBytes/sec
  • 4000 KByte arrays: 3100 MBytes/sec

Obviously, my CPU can read and write more than 32 bits per cycle, since 16000 is more than the 12000 I calculated theoretically above. This means the CPU is even less of a bottleneck than I already thought. I used Visual Studio 2005, and stepping into the standard memcpy implementation, I can see that it uses the movqda instruction on my machine. I guess this can read and write 64 bits per cycle.

The nice code hapalibashi posted achieves 4200 MBytes/sec on my machine - about 40% faster than the VS 2005 implementation. I guess it is faster because it uses the prefetch instruction to improve cache performance.

In summary, the code running on the CPU isn't the bottleneck and tuning that code will only make small improvements.

Flaunch answered 15/8, 2013 at 11:0 Comment(1)
Your thinking process is good. However you lack to think about marketing numbers of RAM, this is all quad pumped figures, which doesn't corresponds to the speed of 1 channel. And it is also the speed before bus, there are management overheads also in the numa model that core i7/opterons have.Erida
S
7

At any optimisation level of -O1 or above, GCC will use builtin definitions for functions like memcpy - with the right -march parameter (-march=pentium4 for the set of features you mention) it should generate pretty optimal architecture-specific inline code.

I'd benchmark it and see what comes out.

Soccer answered 11/11, 2009 at 21:54 Comment(0)
G
3

If specific to Intel processors, you might benefit from IPP. If you know it will run with an Nvidia GPU perhaps you could use CUDA - in both cases it may be better to look wider than optimising memcpy() - they provide opportunities for improving your algorithm at a higher level. They are both however reliant on specific hardware.

Grass answered 11/11, 2009 at 14:10 Comment(0)
S
2

If you're on Windows, use the DirectX APIs, which has specific GPU-optimized routines for graphics handling (how fast could it be? Your CPU isn't loaded. Do something else while the GPU munches it).

If you want to be OS agnostic, try OpenGL.

Do not fiddle with assembler, because it is all too likely that you'll fail miserably to outperform 10 year+ proficient library-making software engineers.

Shutout answered 11/11, 2009 at 14:0 Comment(3)
i need it to be performed in MEMORY, that is, it cannot happen on the GPU. :) Also, i don't intend, myself, to outperform the library functions (hence why i ask the question here) but i'm sure there is somebody on stackoverflow who can outperform the libs :) Further, library writers are typically restricted by portability requirements - as i stated I only care about the x86 platform, so perhaps further x86 specific optimizations are possible.Barmecide
+1 since it's good first advice to be given - even though it does not apply in banister's case.Terbecki
I'm not sure it is good advice. A typical modern machine has about the same memory bandwidth for the CPU and GPU. For example, the many popular laptops use Intel HD graphics, which uses the same RAM as the CPU. The CPU can already saturate the memory bus. For memcpy, I'd expect similar performance on the CPU or GPU.Flaunch
P
2

Old question but two things nobody has pointed out so far:

  1. Most compilers have their own version of memcpy; since memcpy is well defined and also part of the C standard, compilers don't have to use the implementation that ships with system libraries, they are free to use their own one. As the question mentions "intrinsics", well, actually most of the time you write memcpy in your code, you are in fact using a compiler intrinsic function, as that's what the compiler will internally use instead of making a real call to memcpy as then it can even inline it and thus eliminates any function call overhead.

  2. Most memcpy implementations I know already do use stuff like SSE2 internally when available, at least the good ones do. The one of Visual Studio 2005 may not have used that but GCC has been using that for ages. Of course, what they use depends on the build settings. They will only use instructions available to all CPUs the code shall run on, so be sure to set architecture correctly (e.g. march and mtune), as well as other flags (e.g. enabling support for optional instruction sets). All of that influences what code a compiler generates for memcpy in the final binary.

So as always, don't assume you can outsmart the compiler or the system (which may have different memcpy implementations available for different CPUs as well), benchmark to proof that! Unless a benchmark shows that your handwritten code is any faster in real life, rather leave it to the compiler and the system, as they will adopt to new CPUs and the system may get updates that automatically makes your code run faster in the future, whereas you have to re-optimized handwritten code all by yourself and it will never get any faster unless you ship an update yourself.

Profundity answered 2/6, 2023 at 0:50 Comment(1)
Even better, GCC doesn't inline memcpy for unknown or large sizes, so it calls the libc function. On Linux for example, glibc's memcpy implementation uses dynamic linker hooks to resolve the symbol to the most optimal one for the current system, based on CPU detection at dynamic link time. e.g. memmove_avx_unaligned_erms on systems with fast 256-bit unaligned vector load/store (like Haswell and later). codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/…Tortfeasor
C
-1

If you have access to a DMA engine, nothing will be faster.

Counterglow answered 18/11, 2020 at 21:27 Comment(2)
Can you point out any specific DMA engine that might be found in a modern x86 system that can copy memory faster than a CPU core can using SSE or AVX? PCIe 3.0 with an x16 link is only capable of 15.75 GB/s, vs. dual-channel DDR4 2133 MT/s (e.g. a Skylake CPU from 2015) giving a theoretical bandwidth of 34GB/s. So any such DMA engine would need to be attached to the CPU more closely than that. Note that the memory controllers are built-in to the CPU, so any off-chip DMA engine has to get to memory via the CPU, on modern x86.Tortfeasor
A single core of an Intel desktop/laptop chip can come close to saturating DRAM bandwidth (unlike a many-core Xeon). Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? / Enhanced REP MOVSB for memcpyTortfeasor

© 2022 - 2024 — McMap. All rights reserved.