Related: Pascal Cuoq's blog post shows a case where GCC assumes aligned pointers (that two int*
don't partially overlap): GCC always assumes aligned pointer accesses. He also links to a 2016 blog post (A bug story: data alignment on x86) that has the exact same bug as this question: auto-vectorization with a misaligned pointer -> segfault.
gcc4.8 makes a loop prologue that tries to reach an alignment boundary, but it assumes that uint16_t *p
is 2-byte aligned, i.e. that some number of scalar iterations will make the pointer 16-byte aligned.
I don't think gcc ever intended to support misaligned pointers on x86, it just happened to work for non-atomic types without auto-vectorization. It's definitely undefined behaviour in ISO C to use a pointer to uint16_t
with less than alignof(uint16_t)=2
alignment. GCC doesn't warn when it can see you breaking the rule at compile time, and actually happens to make working code (for malloc
where it knows the return-value minimum alignment), but that's presumably just an accident of the gcc internals, and shouldn't be taken as an indication of "support".
Try with -O3 -fno-tree-vectorize
or -O2
. If my explanation is correct, that won't segfault, because it will only use scalar loads (which as you say on x86 don't have any alignment requirements).
gcc knows malloc
returns 16-byte aligned memory on this target (x86-64 Linux, where maxalign_t
is 16 bytes wide because long double
has padding out to 16 bytes in the x86-64 System V ABI). It sees what you're doing and uses movdqu
.
But gcc doesn't treat mmap
as a builtin, so it doesn't know that it returns page-aligned memory, and applies its usual auto-vectorization strategy which apparently assumes that uint16_t *p
is 2-byte aligned, so it can use movdqa
after handling misalignment. Your pointer is misaligned and violates this assumption.
(I wonder if newer glibc headers use __attribute__((assume_aligned(4096)))
to mark mmap
's return value as aligned. That would be a good idea, and would probably have given you about the same code-gen as for malloc
. Except it wouldn't work because it would break error-checking for mmap != (void*)-1
, as @Alcaro points out with an example on Godbolt: https://gcc.godbolt.org/z/gVrLWT)
on a CPU that is able to access unaligned
SSE2 movdqa
segfaults on unaligned, and your elements are themselves misaligned so you have the unusual situation where no array element starts at a 16-byte boundary.
SSE2 is baseline for x86-64, so gcc uses it.
Ubuntu 14.04LTS uses gcc4.8.2 (Off topic: which is old and obsolete, worse code-gen in many cases than gcc5.4 or gcc6.4 especially when auto-vectorizing. It doesn't even recognize -march=haswell
.)
14 is the minimum threshold for gcc's heuristics to decide to auto-vectorize your loop in this function, with -O3
and no -march
or -mtune
options.
I put your code on Godbolt, and this is the relevant part of main
:
call mmap #
lea rdi, [rax+1] # p,
mov rdx, rax # buffer,
mov rax, rdi # D.2507, p
and eax, 15 # D.2507,
shr rax ##### rax>>=1 discards the low byte, assuming it's zero
neg rax # D.2507
mov esi, eax # prolog_loop_niters.7, D.2507
and esi, 7 # prolog_loop_niters.7,
je .L2
# .L2 leads directly to a MOVDQA xmm2, [rdx+1]
It figures out (with this block of code) how many scalar iterations to do before reaching MOVDQA, but none of the code paths lead to a MOVDQU loop. i.e. gcc doesn't have a code path to handle the case where p
is odd.
But the code-gen for malloc looks like this:
call malloc #
movzx edx, WORD PTR [rax+17] # D.2497, MEM[(uint16_t *)buffer_5 + 17B]
movzx ecx, WORD PTR [rax+27] # D.2497, MEM[(uint16_t *)buffer_5 + 27B]
movdqu xmm2, XMMWORD PTR [rax+1] # tmp91, MEM[(uint16_t *)buffer_5 + 1B]
Note the use of movdqu
. There are some more scalar movzx
loads mixed in: 8 of the 14 total iterations are done SIMD, and the remaining 6 with scalar. This is a missed-optimization: it could easily do another 4 with a movq
load, especially because that fills an XMM vector after unpacking
with zero to get uint32_t elements before adding.
(There are various other missed-optimizations, like maybe using pmaddwd
with a multiplier of 1
to add horizontal pairs of words into dword elements.)
Safe code with unaligned pointers:
If you do want to write code which uses unaligned pointers, you can do it correctly in ISO C using memcpy
. On targets with efficient unaligned load support (like x86), modern compilers will still just use a simple scalar load into a register, exactly like dereferencing the pointer. But when auto-vectorizing, gcc won't assume that an aligned pointer lines up with element boundaries and will use unaligned loads.
memcpy
is how you express an unaligned load / store in ISO C / C++.
#include <string.h>
int sum(int *p) {
int sum=0;
for (int i=0 ; i<10001 ; i++) {
// sum += p[i];
int tmp;
#ifdef USE_ALIGNED
tmp = p[i]; // normal dereference
#else
memcpy(&tmp, &p[i], sizeof(tmp)); // unaligned load
#endif
sum += tmp;
}
return sum;
}
With gcc7.2 -O3 -DUSE_ALIGNED
, we get the usual scalar until an alignment boundary, then a vector loop: (Godbolt compiler explorer)
.L4: # gcc7.2 normal dereference
add eax, 1
paddd xmm0, XMMWORD PTR [rdx]
add rdx, 16
cmp ecx, eax
ja .L4
But with memcpy
, we get auto-vectorization with an unaligned load (with no intro/outro to handle alignement), unlike gcc's normal preference:
.L2: # gcc7.2 memcpy for an unaligned pointer
movdqu xmm2, XMMWORD PTR [rdi]
add rdi, 16
cmp rax, rdi # end_pointer != pointer
paddd xmm0, xmm2
jne .L2 # -mtune=generic still doesn't optimize for macro-fusion of cmp/jcc :(
# hsum into EAX, then the final odd scalar element:
add eax, DWORD PTR [rdi+40000] # this is how memcpy compiles for normal scalar code, too.
In the OP's case, simply arranging for pointers to be aligned is a better choice. It avoids cache-line splits for scalar code (or for vectorized the way gcc does it). It doesn't cost a lot of extra memory or space, and the data layout in memory isn't fixed.
But sometimes that's not an option. memcpy
fairly reliably optimizes away completely with modern gcc / clang when you copy all the bytes of a primitive type. i.e. just a load or store, no function call and no bouncing to an extra memory location. Even at -O0
, this simple memcpy
inlines with no function call, but of course tmp
doesn't optimizes away.
Anyway, check the compiler-generated asm if you're worried that it might not optimize away in a more complicated case, or with different compilers. For example, ICC18 doesn't auto-vectorize the version using memcpy.
uint64_t tmp=0;
and then memcpy over the low 3 bytes compiles to an actual copy to memory and reload, so that's not a good way to express zero-extension of odd-sized types, for example.
GNU C __attribute__((aligned(1)))
and may_alias
Instead of memcpy
(which won't inline on some ISAs when GCC doesn't know the pointer is aligned, i.e. exactly this use-case), you can also use a typedef with a GCC attribute to make an under-aligned version of a type.
typedef int __attribute__((aligned(1), may_alias)) unaligned_aliasing_int;
typedef unsigned long __attribute__((may_alias, aligned(1))) unaligned_aliasing_ulong;
related: Why does glibc's strlen need to be so complicated to run quickly? shows how to make a word-at-a-time bithack C strlen safe with this.
Note that it seems ICC doesn't respect __attribute__((may_alias))
, but gcc/clang do. I was recently playing around with that trying to write a portable and safe 4-byte SIMD load like _mm_loadu_si32
(which GCC is missing). https://godbolt.org/z/ydMLCK has various combinations of safe everywhere but inefficient code-gen on some compilers, or unsafe on ICC but good everywhere.
aligned(1)
may be less bad than memcpy on ISAs like MIPS where unaligned loads can't be done in one instruction.
You use it like any other pointer.
unaligned_aliasing_int *p = something;
int tmp = *p++;
int tmp2 = *p++;
And of course you can index it as normal like p[i]
.
gcc
, you can get the assembler output with the-S
option, the assembler output will be written to a.s
file.) – Jennifermmap()
succeeds? Maybe it returns an error... – Dameronvolatile uint8_t dummy = buffer[0];
just after the malloc call? Same bug? What I'm fishing for is that the actual heap allocation may be delayed until the data is actually used. Since the contents of the buffer returned from malloc is guaranteed to hold unspecified values, the C compiler might think that it doesn't have to actually allocate anything. – Swithbartmalloc
intrinsic may be omitted entirely? – Rookiemalloc()
and not withmmap()
– Dameronmmap
. If the dummy read I suggested causes the same bug, then the problem is with the system. Otherwise with the use ofmmap
. – Swithbartvolatile
line you suggested doesn't change anything. However in the meantime the question has been answered to my satisfaction. – Uncloseuint16_t *p = (buffer + 1)
" does that compile? – Wroth