I have a struct consisting of seven __m256 values, which is stored 32-byte aligned in memory.
typedef struct
{
__m256 xl,xh;
__m256 yl,yh;
__m256 zl,zh;
__m256i co;
} bloxset8_t;
I achieve the 32-byte alignment by using the posix_memalign()
function for dynamically allocated data, or using the (aligned(32))
attribute for statically allocated data.
The alignment is fine, but when I use two pointers to such a struct, and pass them as destination and source for memcpy() then the compiler decides to use __memcpy_avx_unaligned()
to copy.
How can I force clang to use the aligned avx memcpy function instead, which I assume is the faster variant?
OS: Ubuntu 16.04.3 LTS, Clang: 3.8.0-2ubuntu4.
UPDATE
The __memcpy_avx_unaligned() is invoked only when copying two or more structs. When copying just one, clang emits 14 vmovup instructions.
assert()
before thememcpy
that asserts that the address is 32-byte aligned. Some compilers can take these hints and use them for optimization. – Semenvmovaps
), unfortunately I can't try 3.8 – Scribe__m256
implies 32B alignment already. But yes, you should usealigned_alloc
orposix_memalign
for dynamic allocation. – Lumberjack