It used to be that ARM processors were unable to properly handle unaligned memory access (ARMv5 and below). Something like u32 var32 = *(u32*)ptr;
would just fail (raise exception) if ptr
was not properly aligned on 4-bytes.
Writing such a statement would work fine for x86/x64 though, since these CPU have always handled such situation very efficiently. But according to C standard, this is not a "proper" way to write it. u32
is apparently equivalent to a structure of 4 bytes which must be aligned on 4 bytes.
A proper way to achieve the same result while keeping the orthodoxy correctness and ensuring full compatibility with any cpu is :
u32 read32(const void* ptr)
{
u32 result;
memcpy(&result, ptr, 4);
return result;
}
This one is correct, will generate proper code for any cpu able or not to read at unaligned positions. Even better, on x86/x64, it's properly optimized to a single read operation, hence has the same performance as the first statement. It's portable, safe, and fast. Who can ask more ?
Well, problem is, on ARM, we are not so lucky.
Writing the memcpy
version is indeed safe, but seems to result in systematic cautious operations, which are very slow for ARMv6 and ARMv7 (basically, any smartphone).
In a performance oriented application which heavily relies on read operations, the difference between the 1st and 2nd version could be measured : it stands at > 5x at gcc -O2
settings. This is way too much to be ignored.
Trying to find a way to use ARMv6/v7 capabilities, I've looked for guidance on a few example codes around. Unfortunatley, they seem to select the first statement (direct u32
access), which is not supposed to be correct.
That's not all : new GCC versions are now trying to implement auto-vectorization. On x64, that means SSE/AVX, on ARMv7 that means NEON. ARMv7 also supports some new "Load Multiple" (LDM) and "Store Multiple" (STM) opcodes, which require pointer to be aligned.
What does that mean ? Well, the compiler is free to use these advanced instructions, even if they were not specifically called from the C code (no intrinsic). To take such decision, it uses the fact the an u32* pointer
is supposed to be aligned on 4 bytes. If it's not, then all bets are off : undefined behavior, crashes.
What that means is that even on CPU which support unaligned memory access, it's now dangerous to use direct u32
access, as it can lead to buggy code generation at high optimization settings (-O3
).
So now, this is a dilemna : how to access the native performance of ARMv6/v7 on unaligned memory access without writing the incorrect version u32
access ?
PS : I've also tried __packed()
instructions, and from a performance perspective, they seem to work exactly the same as the memcpy
method.
[Edit] : Thanks for the excellent elements received so far.
Looking at the generated assembly, I could confirm @Notlikethat finding that memcpy
version does indeed generate proper ldr
opcode (unaligned load). However, I also found that the generated assembly uselessly invokes str
(command). So the complete operation is now an unaligned load, an aligned store, and then a final aligned load. That's a lot more work than necessary.
Answering @haneefmubarak, yes the code is properly inlined. And no, memcpy
is very far from providing the best possible speed, since forcing the code to accept direct u32
access translates into huge performance gains. So some better possibility must exist.
A big thank to @artless_noise. The link to godbolt service is unvaluable. I've never been able to see so clearly the equivalence between a C source code and its assembly representation. This is highly inspiring.
I completed one of @artless examples, and it gives the following :
#include <stdlib.h>
#include <memory.h>
typedef unsigned int u32;
u32 reada32(const void* ptr) { return *(const u32*) ptr; }
u32 readu32(const void* ptr)
{
u32 result;
memcpy(&result, ptr, 4);
return result;
}
once compiled using ARM GCC 4.8.2 at -O3 or -O2 :
reada32(void const*):
ldr r0, [r0]
bx lr
readu32(void const*):
ldr r0, [r0] @ unaligned
sub sp, sp, #8
str r0, [sp, #4] @ unaligned
ldr r0, [sp, #4]
add sp, sp, #8
bx lr
Quite telling ....
ldr r0, [r0]; str r0, [sp, #4]; ldr r0, [sp, #4]
. Shame it can't elide the use of the local variable entirely, but there's your unaligned word load right there; no multiple byte loads or out-of-line call to memcpy. – Whallldr w0, [x0]; ret
– Whallmemcpy
version, which is much cleaner to read and is fully standard (well, at least as long as standard libs are available ...) – Schumacherbx lr
an x86 instruction? Yes, this is a GCC/ARM problem, that's rather my point - other ARM-targeted compilers optimise the memcpy to a single unaligned load (at least I checked Clang and armcc, I don't have others like IAR or the TI one to hand to test). GCC 5.2 is still stupid. Realistically, I rather doubt that there exists a simple solution which is clear, correct, portable, optimal everywhere and works around a GCC performance bug... – Whall