Hint to compiler that it can use aligned memcpy

About

Asked 10/11, 2017 at 22:1 Answered 10/11, 2017 at 22:15

Solved c glibc memcpy memory-alignment avx

I have a struct consisting of seven __m256 values, which is stored 32-byte aligned in memory.

typedef struct
{
        __m256 xl,xh;
        __m256 yl,yh;
        __m256 zl,zh;
        __m256i co;
} bloxset8_t;

I achieve the 32-byte alignment by using the posix_memalign() function for dynamically allocated data, or using the (aligned(32)) attribute for statically allocated data.

The alignment is fine, but when I use two pointers to such a struct, and pass them as destination and source for memcpy() then the compiler decides to use __memcpy_avx_unaligned() to copy.

How can I force clang to use the aligned avx memcpy function instead, which I assume is the faster variant?

OS: Ubuntu 16.04.3 LTS, Clang: 3.8.0-2ubuntu4.

UPDATE
The __memcpy_avx_unaligned() is invoked only when copying two or more structs. When copying just one, clang emits 14 vmovup instructions.

Lyre answered 10/11, 2017 at 22:1 Comment(4)

Untested, but worth a try: I think I've seen this done before by adding an assert() before the memcpy that asserts that the address is 32-byte aligned. Some compilers can take these hints and use them for optimization. – Semen 10/11, 2017 at 22:8

I could not reproduce this with Clang 3.9 (I get a bunch of vmovaps), unfortunately I can't try 3.8 – Scribe 10/11, 2017 at 22:17

@harold memcpy_avx_unaligned() is used if you copy two or more structs in one go. One struct is indeed done with move instructions, which in my case are unaligned: vmovup (and it uses 14 of them.) – Lyre 10/11, 2017 at 22:28

I think for static / automatic storage, you're already fine for alignment. __m256 implies 32B alignment already. But yes, you should use aligned_alloc or posix_memalign for dynamic allocation. – Lumberjack 10/11, 2017 at 23:14

__memcpy_avx_unaligned is just an internal glibc function name. It does not mean that there is a faster __memcpy_avx_aligned function. The name is just convey a hint to the glibc developers how this memcpy variant is implemented.

The other question is whether it would be faster for the C compiler to emit an inline expansion of memcpy, using four AVX2 load/store operations. The code for that would be larger than the memcpy call, but it might still be faster overall. It may be possible to help the compiler to do this using the __builtin_assume_aligned builtin.

Forging answered 10/11, 2017 at 22:15 Comment(1)

Near duplicate of perf report shows this function "__memset_avx2_unaligned_erms" has overhead. does this mean memory is unaligned?, or at least related. I went into more detail there about how that specific glibc memset strategy works. – Lumberjack 31/7, 2018 at 23:1

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags