Take advantage of ARM unaligned memory access while writing clean C code
Asked Answered
S

2

38

It used to be that ARM processors were unable to properly handle unaligned memory access (ARMv5 and below). Something like u32 var32 = *(u32*)ptr; would just fail (raise exception) if ptr was not properly aligned on 4-bytes.

Writing such a statement would work fine for x86/x64 though, since these CPU have always handled such situation very efficiently. But according to C standard, this is not a "proper" way to write it. u32 is apparently equivalent to a structure of 4 bytes which must be aligned on 4 bytes.

A proper way to achieve the same result while keeping the orthodoxy correctness and ensuring full compatibility with any cpu is :

u32 read32(const void* ptr) 
{ 
    u32 result; 
    memcpy(&result, ptr, 4); 
    return result; 
}

This one is correct, will generate proper code for any cpu able or not to read at unaligned positions. Even better, on x86/x64, it's properly optimized to a single read operation, hence has the same performance as the first statement. It's portable, safe, and fast. Who can ask more ?

Well, problem is, on ARM, we are not so lucky.

Writing the memcpy version is indeed safe, but seems to result in systematic cautious operations, which are very slow for ARMv6 and ARMv7 (basically, any smartphone).

In a performance oriented application which heavily relies on read operations, the difference between the 1st and 2nd version could be measured : it stands at > 5x at gcc -O2 settings. This is way too much to be ignored.

Trying to find a way to use ARMv6/v7 capabilities, I've looked for guidance on a few example codes around. Unfortunatley, they seem to select the first statement (direct u32 access), which is not supposed to be correct.

That's not all : new GCC versions are now trying to implement auto-vectorization. On x64, that means SSE/AVX, on ARMv7 that means NEON. ARMv7 also supports some new "Load Multiple" (LDM) and "Store Multiple" (STM) opcodes, which require pointer to be aligned.

What does that mean ? Well, the compiler is free to use these advanced instructions, even if they were not specifically called from the C code (no intrinsic). To take such decision, it uses the fact the an u32* pointer is supposed to be aligned on 4 bytes. If it's not, then all bets are off : undefined behavior, crashes.

What that means is that even on CPU which support unaligned memory access, it's now dangerous to use direct u32 access, as it can lead to buggy code generation at high optimization settings (-O3).

So now, this is a dilemna : how to access the native performance of ARMv6/v7 on unaligned memory access without writing the incorrect version u32 access ?

PS : I've also tried __packed() instructions, and from a performance perspective, they seem to work exactly the same as the memcpy method.

[Edit] : Thanks for the excellent elements received so far.

Looking at the generated assembly, I could confirm @Notlikethat finding that memcpy version does indeed generate proper ldr opcode (unaligned load). However, I also found that the generated assembly uselessly invokes str (command). So the complete operation is now an unaligned load, an aligned store, and then a final aligned load. That's a lot more work than necessary.

Answering @haneefmubarak, yes the code is properly inlined. And no, memcpy is very far from providing the best possible speed, since forcing the code to accept direct u32 access translates into huge performance gains. So some better possibility must exist.

A big thank to @artless_noise. The link to godbolt service is unvaluable. I've never been able to see so clearly the equivalence between a C source code and its assembly representation. This is highly inspiring.

I completed one of @artless examples, and it gives the following :

#include <stdlib.h>
#include <memory.h>
typedef unsigned int u32;

u32 reada32(const void* ptr) { return *(const u32*) ptr; }

u32 readu32(const void* ptr) 
{ 
    u32 result; 
    memcpy(&result, ptr, 4); 
    return result; 
}

once compiled using ARM GCC 4.8.2 at -O3 or -O2 :

reada32(void const*):
    ldr r0, [r0]
    bx  lr
readu32(void const*):
    ldr r0, [r0]    @ unaligned
    sub sp, sp, #8
    str r0, [sp, #4]    @ unaligned
    ldr r0, [sp, #4]
    add sp, sp, #8
    bx  lr

Quite telling ....

Schumacher answered 18/8, 2015 at 3:9 Comment(19)
I doubt that you'll be able to find anything faster than memcpy, unfortunately.Lake
It is not dangerous to use u32. It's dangerous to tell the compiler that you know better than it what the thing it is accessing is (explicit casting), when this is in fact not true.Culicid
No repro. Using a Linaro GCC 4.8.3, with -march=armv6 and -O1, the above function compiles to essentially ldr r0, [r0]; str r0, [sp, #4]; ldr r0, [sp, #4]. Shame it can't elide the use of the local variable entirely, but there's your unaligned word load right there; no multiple byte loads or out-of-line call to memcpy.Whall
For instance godbolt gives real output and an example with main.Seeing
Thanks for these insightful elements. I've updated the question with new information thanks to godbolt.Schumacher
Frankly I think the useless stack touching is probably just a bug/missing optimisation in ARM GCC. With AArch64 GCC, the unoptimised code looks like that ARM code; on -O1 it compiles the whole thing to ldr w0, [x0]; retWhall
I can understand and accept the argument that GCC is simply missing a trivial optimisation. That being said, GCC is the compiler used out there, and I can't just default to "hey, the performance is crap, but it's the fault of GCC". I have to find a solution, and it needs to work today, with current compilers.Schumacher
Yup, Clang agrees that GCC is just being rubbish.Whall
@Cyan: yes, of course you need to get decent performance with the compilers out there today. But you still haven't shown any real code causing suboptimal behaviour - and if what you're working at (which the example you've posted suggests) is getting illegal C code working by inserting code shims to deal with incorrectly specified alignment of variables, instead of fixing the incorrect alignment specifications, you will not get there regardless of compiler. Any chance you could ... I dunno, post a new question with a specific real code sequence that generates suboptimal code?Culicid
@Cyan: to clarify - I would like to write an answer, but don't see quite how I can with the way this question is written. Short of risking it turning into a generic "get off my lawn"-style semi-rant about the evils of C-code that could only ever work on x86.Culicid
@Whall : it's not that simple. If you select x86 target, even with GCC you get the same compact assembly output. The problem seems specific to GCC/ARM combination; and godbolt doesn't provide a Clang/Arm combination to compare to.Schumacher
@Culicid Here is an example code if you wishSchumacher
@Cyan: (commenting on your response to Notlikethat) To be precise, the problem seems specific to the not-x86/GCC combination. As for the code you linked to, it's basically doing it's damndest to fight the compiler. If ptr is actually to a u32 stored unaligned, why not pass it in as a pointer to an __attribute__((packed)) struct holding only a u32 value, and let the compiler sort the alignment fiddling out? Like so?: goo.gl/PDevc2Culicid
@Culicid : this is also my current conclusion, but I was hoping to find another one. The problem I've got with packed attribute is that it is not standard C : it is compiler specific extension. As a consequence, expressing this characteristic is different with each compiler, and sometimes it also differs depending on compiler version. For a portable code, this is a nightmare to manage. That's why I initially switched to memcpy version, which is much cleaner to read and is fully standard (well, at least as long as standard libs are available ...)Schumacher
Let us continue this discussion in chat.Culicid
@cyan Huh? The example I linked is using the ARM LLVM backend (hence the "-target arm" option to Clang) - since when was bx lr an x86 instruction? Yes, this is a GCC/ARM problem, that's rather my point - other ARM-targeted compilers optimise the memcpy to a single unaligned load (at least I checked Clang and armcc, I don't have others like IAR or the TI one to hand to test). GCC 5.2 is still stupid. Realistically, I rather doubt that there exists a simple solution which is clear, correct, portable, optimal everywhere and works around a GCC performance bug...Whall
@Whall : yes sorry. The compiler says "x86 clang" and I read it like "x86 gcc". But these compilers have in fact different ways of dealing with multiple targets : gcc needs a different binary name for each platform, while clang get it from command line parameter.Schumacher
I assume this was related to xxhash and other your projects. Since we know that memcpy is probably the fastest way to read unaligned memory why don't you memcpy to a buffer of 128 ints or something like that and then use them instead of memcpy them one by one?..Frazier
You mean, for CPU which do not support unaligned access ? That's probably a good idea for read performance on such target. However, it's also consuming memory (128x4=512 bytes). While it doesn't seem much, some environments, like kernel space for example, will not like it. There are also micro-threads environments, where each thread has a tiny amount of stack space available. Anyway, what I mean is, this trade-off is accessible from user space, but seems out of scope for a library like xxhash, as it would restrain the number of compatible environments.Schumacher
S
26

OK, the situation is more confusing than one would like. So, in an effort to clarify, here are the findings on this journey :

accessing unaligned memory

  1. The only portable C standard solution to access unaligned memory is the memcpy one. I was hoping to get another one through this question, but apparently it's the only one found so far.

Example code :

u32 read32(const void* ptr)  { 
    u32 value; 
    memcpy(&value, ptr, sizeof(value)); 
    return value;  }

This solution is safe in all circumstances. It also compiles into a trivial load register operation on x86 target using GCC.

However, on ARM target using GCC, it translates into a way too large and useless assembly sequence, which bogs down performance.

Using Clang on ARM target, memcpy works fine (see @notlikethat comment below). It would be easy to blame GCC at large, but it's not that simple : the memcpy solution works fine on GCC with x86/x64, PPC and ARM64 targets. Lastly, trying another compiler, icc13, the memcpy version is surprisingly heavier on x86/x64 (4 instructions, while one should be enough). And that's just the combinations I could test so far.

I have to thank godbolt's project to make such statements easy to observe.

  1. The second solution is to use __packed structures. This solution is not C standard, and entirely depends on compiler's extension. As a consequence, the way to write it depends on the compiler, and sometimes on its version. This is a mess for maintenance of portable code.

That being said, in most circumstances, it leads to better code generation than memcpy. In most circumstances only ...

For example, regarding the above cases where memcpy solution does not work, here are the findings :

  • on x86 with ICC : __packed solution works
  • on ARMv7 with GCC : __packed solution works
  • on ARMv6 with GCC : does not work. Assembly looks even uglier than memcpy.

    1. The last solution is to use direct u32 access to unaligned memory positions. This solution used to work for decades on x86 cpus, but is not recommended, as it violates some C standard principles : compiler is authorized to consider this statement as a guarantee that data is properly aligned, leading to buggy code generation.

Unfortunately, in at least one case, it is the only solution able to extract performance from target. Namely for GCC on ARMv6.

Do not use this solution for ARMv7 though : GCC can generate instructions which are reserved for aligned memory accesses, namely LDM (Load Multiple), leading to crash.

Even on x86/x64, it becomes dangerous to write your code this way nowadays, as the new generation compilers may try to auto-vectorize some compatible loops, generating SSE/AVX code based on the assumption that these memory positions are properly aligned, crashing the program.

As a recap, here are the results summarized as a table, using the convention : memcpy > packed > direct.

| compiler  | x86/x64 | ARMv7  | ARMv6  | ARM64  |  PPC   |
|-----------|---------|--------|--------|--------|--------|
| GCC 4.8   | memcpy  | packed | direct | memcpy | memcpy |
| clang 3.6 | memcpy  | memcpy | memcpy | memcpy |   ?    |
| icc 13    | packed  | N/A    | N/A    | N/A    | N/A    |
Schumacher answered 19/8, 2015 at 12:8 Comment(2)
This chart is handy, but it seems that since about gcc 5, an -march=armv7-a will be fine with the memcpy() variant. The issue is the way that older ARM CPUs would handle unaligned read/writes. So anyone reading this post currently 2019 should be aware that -march values will affect things significantly. It is possible that the GCC ARM back end (and infra-structure) was update to grok that newer ARM cpus are OK will unaligned access. See: Linux trapping un-aligned access for some more on the topic.Seeing
Any updates 9 years later? I assume the compiler writers added a lot of more optimization into it in the meantime.Stockade
S
5

Part of the issue is likely that you are not allowing for easy inlinability and further optimization. Having a specialized function for the load means that a function call may be emitted upon each call, which could reduce the performance.

One thing you might do is use static inline, which will allow the compiler to inline the function load32(), thus increasing performance. However, at higher levels of optimization, the compiler should already be inlining this for you.

If the compiler inlines a 4 byte memcpy, it will likely transform it into the most efficient series of loads or stores that will still work on unaligned boundaries. Therefore, if you are still seeing low performance even with compiler optimizations enabled, it may be so that that is the maximum performance for unaligned reads and writes on the processors you are using. Since you said "__packed instructions" are yielding identical performance to memcpy(), this would seem to be the case.


At this point, there is very little that you can do except to align your data. However, if you are dealing with a contiguous array of unaligned u32's, there is one thing you could do:

#include <stdint.h>
#include <stdlib.h>

// get array of aligned u32
uint32_t *align32 (const void *p, size_t n) {
    uint32_t *r = malloc (n * sizeof (uint32_t));

    if (r)
        memcpy (r, p, n);

    return r;
}

This just uses allocates a new array using malloc(), because malloc() and friends allocate memory with correct alignment for everything:

The malloc() and calloc() functions return a pointer to the allocated memory that is suitably aligned for any kind of variable.

- malloc(3), Linux Programmer's Manual

This should be relatively fast, as you should only have to do this once per set of data. Also, while copying it, memcpy() will be able to adjust only for the initial lack of alignment and then use the fastest aligned load and store instructions available, after which you will be able to deal with your data using the normal aligned reads and writes at full performance.

Symphonic answered 18/8, 2015 at 3:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.