Does rewriting memcpy/memcmp/... with SIMD instructions make sense?

Asked 16/3, 2011 at 5:21 Answered 26/5, 2022 at 16:24

Does rewriting memcpy/memcmp/... with SIMD instructions make sense in a large scale software?

If so, why doesn't GCC generate SIMD instructions for these library functions by default?

Also, are there any other functions can be possibly improved by SIMD?

Scorn answered 16/3, 2011 at 5:21 Comment(3)

It depends on what OS and compiler libraries you are using. E.g. Mac OS X already has SIMD-optimised memcpy et al. Also Intel's ICC generates inline memcpys which are faster than anything you are likely to be able to implement in a library. – Maximamaximal 16/3, 2011 at 6:41

@Paul: memcpy is actually the worst case for an SSE intrinsic, because SSE can't be used for the edge cases. Do those compilers emit SIMD code for strlen and memchr? – Huihuie 16/3, 2011 at 13:56

@Ben: I just checked with ICC 12 - memcpy and strlen both emit inline SSE code, strchr is a library function which appears to just be straight scalar code. – Maximamaximal 16/3, 2011 at 15:12

Yes, these functions are much faster with SSE instructions. It would be nice if your runtime library/compiler instrinsics would include optimized versions, but that doesn't seem to be pervasive.

I have a custom SIMD memchr which is a hell-of-a-lot faster than the library version. Especially when I'm finding the first of 2 or 3 characters (example, I want to know if there's an equation in this line of text, I search for the first of =, \n, \r).

On the other hand, the library functions are well tested, so it's only worth writing your own if you call them a lot and a profiler shows they're a significant fraction of your CPU time.

Huihuie answered 16/3, 2011 at 5:31 Comment(6)

A SIMD memcpy will normally only be faster for copies where source and/or dest are already in cache, since almost any half decent memcpy should be able to saturate the available DRAM bandwidth. – Maximamaximal 16/3, 2011 at 6:38

@Paul: SIMD is better always. If it's not strictly faster because memory access can't keep up, that core is freed up for hyperthreading, power saving, or speculative out-of-order execution. As Crashworks said, SSE will also fetch data into cache faster, because of prefetch hinting. Without SSE, the CPU may have to alternate between fetching data and doing the copy, with SSE both occur in parallel. – Huihuie 16/3, 2011 at 13:37

in the case of memcpy et al there isn't anything else going on in the execution thread, so no benefit there. If your core is stalled waiting for a DRAM access there's not much you can do - DRAM latency can be of the order of 200 clocks, which is a lot of instructions cycles with nothing to do. – Maximamaximal 16/3, 2011 at 13:41

@Paul: (1) Not all memcpy calls are for thousands of bytes. You may easily have a memcpy call for ~20 bytes inside a loop with other processing. (2) Modern CPU cores aren't limited to processing instructions from a single thread, hence my mention of hyperthreading. (3) DRAM latency is less important when read prefetches are pipelined, only throughput is. (4) Even if DRAM throughput is hobbling the code, it's still better to perform the copy efficiently because the CPU can do the work in the same time and less power consumption (for example, dynamically lowered clock frequency) – Huihuie 16/3, 2011 at 13:55

What craptastic library are you using that doesn't have a good SIMD memchr? Glibc's has hand-written asm versions of memchr / strchr / memmove and so on for i386 and x86-64 (and most other ISAs) that are excellent for large buffers, and many have good small-buffer strategies, too. (With runtime dispatching via dynamic linker symbol resolution so it can use AVX2 on compatible CPUs even in binaries compiled without -mavx2). The main thing you could gain is if you know your buffer is aligned and/or at least 16 bytes long so you can avoid branching to pick a strategy. – Watchword 11/2, 2020 at 15:24

e.g. code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/… is glibc's memchr with vpcmpeqb of 4 vectors, then vpor them all together to save on vpmovmskb + test uops, with a loop branch once per 2 cache lines. – Watchword 11/2, 2020 at 15:25

It does not make sense. Your compiler ought to be emitting these instructions implicitly for memcpy/memcmp/similar intrinsics, if it is able to emit SIMD at all.

You may need to explicitly instruct GCC to emit SSE opcodes with eg -msse -msse2; some GCCs do not enable them by default. Also, if you do not tell GCC to optimize (ie, -o2), it won't even try to emit fast code.

The use of SIMD opcodes for memory work like this can have a massive performance impact, because they also include cache prefetches and other DMA hints that are important for optimizing bus access. But that doesn't mean that you need to emit them manually; even though most compiler stink at emitting SIMD ops generally, every one I've used at least handles them for the basic CRT memory functions.

Basic math functions can also benefit a lot from setting the compiler to SSE mode. You can easily get an 8x speedup on basic sqrt() just by telling the compiler to use the SSE opcode instead of the terrible old x87 FPU.

Zsazsa answered 16/3, 2011 at 5:34 Comment(2)

Agreed that memcpy is the most likely to be properly optimized. A lot of other functions from <string.h> and <memory.h> also benefit immensely and aren't widely optimized by the compiler. – Huihuie 16/3, 2011 at 5:40

@BenVoigt: GCC doesn't always inline good versions of library functions, but good libraries have good hand-written asm. e.g. Why is this code 6.5x slower with optimizations enabled? shows a case where GCC inlines a very bad repne scasb strlen at -O1, or a complex 32-bit-at-a-time bithack at -O2 which doesn't take any advantage of SSE2. The program depends entirely on strlen performance for huge buffers so it's a big win for it to call glibc's optimized version. There's a big different between library and inline. – Watchword 11/2, 2020 at 15:29

It probably doesn't matter. The CPU is much faster than memory bandwidth, and the implementations of memcpy etc. provided by the compiler's runtime library are probably good enough. In "large scale" software your performance is not going to be dominated by copying memory, anyway (it's probably dominated by I/O).

To get a real step up in memory copying performance, some systems have a specialised implementation of DMA that can be used to copy from memory to memory. If a substantial performance increase is needed, hardware is the way to get it.

Potsherd answered 16/3, 2011 at 5:32 Comment(1)

That largely depends on whether you're using a horribly slow I/O API like C++ iostreams. It's hard to perform any non-trivial processing at the speed the OS can deliver I/O. Besides, SIMD is faster for a variety of reasons, especially on smaller blocks where the setup of a DMA engine would be prohibitively expensive. For one thing, SSE uses a different set of CPU registers, so your working variables stay enregistered and don't get spilled to cache. – Huihuie 16/3, 2011 at 5:37

I recommend looking at DPDK memcpy implementation which uses SIMD instructions to have a high throughput memcpy implementation:

https://git.dpdk.org/dpdk/tree/lib/eal/x86/include/rte_memcpy.h

Intel claims 22% better performance for SIMD-memcpy in OpenvSwitch than ordinary memcpy.

From Intel webpage:

Fredel answered 26/5, 2022 at 16:24 Comment(0)

-1

on x86 hardware, it should not matter much, with out-of-order processing. Processor will achieve necessary ILP and try to issue max number of load/store operations per cycle for memcpy, whether it be SIMD or Scalar instruction set.

Cavein answered 19/4, 2011 at 23:16 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags