Fastest way to work with unaligned data on a word-aligned processor?
Asked Answered
P

3

17

I'm doing a project on an ARM Cortex M0, which does not support unaligned(by 4bytes) access, and I'm trying to optimize the speed of operations on unaligned data.

I'm storing Bluetooth Low Energy Access addresses (48bit) as 6-byte arrays in some packed structs acting as packet buffers. Because of the packing, the BLE addresses are not necessarily starting at a word aligned address, and I'm running into some complications when optimizing my access functions to these addresses.

The first, and most obvious approach is a for loop operating on each byte in the array individually. Checking if two addresses are the same could for instance be done like this:

uint8_t ble_adv_addr_is_equal(uint8_t* addr1, uint8_t* addr2)
{
  for (uint32_t i = 0; i < 6; ++i)
  {
    if (addr1[i] != addr2[i])
      return 0;
  }
  return 1;
}

I'm doing a lot of comparisons in my project, and I wanted to see if I could squeeze some more speed out of this function. I realised that for aligned addresses, I could cast them to uint64_t, and compare with 48 bit masks applied, i.e.

((uint64_t)&addr1[0] & 0xFFFFFFFFFFFF) == ((uint64_t)&addr2[0] & 0xFFFFFFFFFFFF)

Similar operations could be done for writing, and it works well for aligned versions. However, since my addresses aren't always word-aligned (or even half-word), I would have to do some extra tricks to make this work.

First off, I came up with this unoptimized nightmare of a compiler macro:

#define ADDR_ALIGNED(_addr) (uint64_t)(((*((uint64_t*)(((uint32_t)_addr) & ~0x03)) >> (8*(((uint32_t)_addr) & 0x03))) & 0x000000FFFFFFFF)\
                                    | (((*((uint64_t*)(((uint32_t)_addr+4) & ~0x03))) << (32-8*(((uint32_t)_addr) & 0x03)))) & 0x00FFFF00000000)

It basically shifts the entire address to start at the previous word aligned memory position, regardless of offset. For instance:

    0       1       2       3
|-------|-------|-------|-------|
|.......|.......|.......|<ADDR0>|
|<ADDR1>|<ADDR2>|<ADDR3>|<ADDR4>|
|<ADDR5>|.......|.......|.......|

becomes

    0       1       2       3
|-------|-------|-------|-------|
|<ADDR0>|<ADDR1>|<ADDR2>|<ADDR3>|
|<ADDR4>|<ADDR5>|.......|.......|
|.......|.......|.......|.......|

and I can safely do a 64-bit comparison of two addresses, regardless of their actual alignment:

ADDR_ALIGNED(addr1) == ADDR_ALIGNED(addr2)

Neat! But this operation takes 71 lines of assembly when compiled with the ARM-MDK, compared to 53 when doing the comparison in a simple for loop (I'm just going to disregard the additional time spent in the branch instructions here), and ~30 when unrolled. Also, it doesn't work for writes, as the alignment only happens in the registers, not in memory. Unaligning it again would require a similar operation, and the whole approach generally seems to suck.

Is an unrolled for-loop working each byte individually really the fastest solution for cases like this? Does anyone have any experience with similar scenarios, and feel like sharing some of their wizardry here?

Prickly answered 24/4, 2015 at 20:5 Comment(8)
Is your data guarenteed to begin (the 1st byte of source data) on a 4-byte boundary and is your final destination also on a 4-byte boundary? If not, can you at least guarnentee that the whole thing is at least 2-byte aligned? If so, then you can do some wizardry.Doublecross
Why not store your bluetooth addresses directly using an array of uint64_t?Rajiv
@Rajiv -- the OP's on a Cortex-M0, so there's not exactly a ton of RAM to burn; also, wire formats sometimes mandate packed data.Federalese
He's reading into the middle of a 48-bit (16-bit) aligned stream. ARM won't read that alignment for you by default.Doublecross
@MichaelDorgan: That's a no on all points, I'm afraid. The addresses can come directly from on-air packet contents, where they may have any given byte offset. However, about 60% of my comparisons are against "neighbor lists", where I may align as I choose. For these cases, I might be better off making an aligned copy of my unaligned address, and create an optimized aligned version of the function. This would save me some time for long lists, but I'm not sure if the overhead would be worth it.Prickly
I suggest __attribute__((aligned(2))) struct BLEA { uint8_t data[6]; }; to start with, and making a copy into it if needed.Womble
I believe it is wrong to concentrate solely on an instruction count. You memory bandwidth is also a concern. Loading multiple bytes runs several cycle on the bus. Your TCM memory is fast (and you haven't said where the buffers are). You should look at memcpy() and memset() implementation for the Cortex-M to see how the head/tail casing is done. I think a switch(unaligned&3) will be efficient, but the code will not be small.Petua
Close-voters: I'm voting to leave this open, because it is not "a matter of opinion" whether a particular approach is faster: in this specific case it's measureable. This question is in the spirit of SO, even if it's hard to reword it so it doesn't fall foul of the guidelines. It can be expected to draw useful, enduring and factual answers.Haney
D
6

UPDATE

Ok, because your data has no alignment whatsover, you need to either read all the data in, byte by byte, into properly aligned buffers and then do really fast 64-bit compares, or, if you won't be using the data after the compares, just read in the data as bytes and do 6 compares, in which case calling memcmp() might be a better option.


For at least 16-bit aligned:


 u16 *src1 = (u16 *)addr1; 
 u16 *src2 = (u16 *)addr2;

 for (int i = 0; i < 3; ++i)
 {
    if (src1[i] != src2[i])
      return 0;
 }

 return 1;

Will be twice as fast as byte comparisons and might be the best you can reasonably do as long as your data is at least 2-byte aligned. I'd also expect the compiler to remove the for loop completely and just use conditionally executed if statements instead.

Trying for 32-bit aligned reads will not be faster unless you can guarentee that source1 and 2 are similiarly aligned (add1 & 0x03) == (addr2 & 0x03). If this is the case, then you can read in a 32-bit value and then a 16-bit (or visa-versa, depending on starting alignment) and remove 1 more compare.

As 16-bit is your shared base, you can start there and the compiler should generate nice ldrh type opcodes.

Doublecross answered 24/4, 2015 at 20:22 Comment(6)
For all of your disagreements with me, I'd like to point out that memcmp() is not inlined by any of the compilers on gcc.godbolt.Gringo
Lol - I'll take your word for it. I dunno.Doublecross
Your best be might be to memcpy 2 bytes into a uint16_t and 4 bytes into a uint32_t, and compare those as a way to express the unaligned load.Quartile
And BTW, you might think that uint64_t foo = 0; and then loading the first 6 bytes with memcpy(&foo, buf, 6) would be good too, but gcc doesn't detect that special case and doesn't just keep foo in a register. It does inline memcpy, but with load / store to the stack and then a reload when it wants to actually use foo.Quartile
IIRC gcc has a specific optimization flag to inline string ops (including mem*()) and a way to specify the algorithms used (byte, word, etc...)Renoir
This issue is less and less so with more modern ARM processors being barely phased by misaligned reads. Makes things so much simpler.Doublecross
G
4

You might get your compiler to choose the fastest way for you:

#include <stdint.h>
#include <stddef.h>
#include <string.h>

uint64_t unalignedload(char const *packed)
{
  uint64_t buffer;
  memcpy(&buffer, packed, 8);
  return buffer;
}

This is not quite what you want, as loading 8 bytes might segfault if you're unaligned and run off the page, but it's a start. If you can add two bytes of padding to the end of the array, you can easily avoid that problem.
gcc and clang seem to optimize this well.

Gringo answered 24/4, 2015 at 20:21 Comment(16)
memcpy will just do a byte copy here if packed is unaligned. This doesn't really solve his issue at all and probably slows things down.Doublecross
@MichaelDorgan: Good compilers (gcc, clang, icc,...) will optimize and inline this constant-sized memcpy() on high optimization levels.Gringo
And if unaligned, which it will be probably 75% of the time here? The compiler then must add a compare for alignment, which we already know and it does not, then branch to an appropriate copy.Doublecross
@MichaelDorgan: I've looked at the assembly generated by the compilers available on gcc.godbolt.org, and for all supported architectures it compiles to one or two loads. No branches.Gringo
Ok, just gave it a shot. It generates 4-byte reads, even for data that is not 4-byte aligned. That will segfault on ARM. I took your function and passed in an address that was forced non-4-byte aligned and it still read the data via ldr instructions which won't work. Non- aligned data must be read as its alignment allows. In my case, passing in &addr[1] of a char array, byte reads are required.Doublecross
@MichaelDorgan: Huh. If the load actually sigbuses on ARM, that seems like a compiler bug to me. Can you confirm that it crashes?Gringo
On the (alas very old) ARM systems I work on, it will, on other systems, the MMU will "help" you by munging your data - repeating the bottom byte, adding in 0s, fun stuff like that. I've come to accept that ARM doesn't do memory alignement so I help it as much as possible. Turns out this is a dead point though as OP's data has no alignment so my code won't help either :)Doublecross
@MichaelDorgan: I stand by my solution. It is, to the best of my knowledge, valid C. Any compiler that produces crashing code for it is broken. If your old ARM systems are no longer supported by modern compilers, that's a different problem entirely.Gringo
Check this out for a bit more fun: #20926886 This is why when i saw the compiler you gave producing ldr instructions regardless of alignment, I became concerned.Doublecross
@MichaelDorgan: What does this have to do with anything? I'm not denying that some systems (for good reasons) only allow naturally aligned loads. I'm saying that compilers know about it, and should not generate crashing code from valid C.Gringo
You are correct - your code is 100% valid and I do not deny that at all. The output that was given from that sight though is flawed and gives the impression of more efficiency than is possible though. It produces illegal codes for non-aligned addresses on memcpy and that was the reason for the link. Bah. We have big sticks and have beat this dead horse in this now dying thread. I'd love to chat if needed, but I need to get back to work. I wish you good luck on all your future endevours!Doublecross
@EOF: That's a pretty solid idea, but I can't get gcc or armcc to inline the memcmp call for my function, I think it can smell the problems coming:) Since I'm also staying away from all library modules (except those juicy stdint-defines), I think I'll have to go with the hybrid solution I described in my comment to the OP.Prickly
@dromtrund: Err, I'm not suggesting a memcmp(). You appear to have confused answers.Gringo
@EOF: err, yes. I've been using memcmp() as my reference use case when working with this, got it mixed up.Prickly
@dromtrund: So, did you actually try my solution, before accepting the answer that suggests using memcmp(), which isn't inlined?Gringo
@EOF: Yes, but gcc supplied with parameters -O3 -mcpu=cortex-m0 -mthumb -mabi=aapcs -mfloat-abi=soft only inlined it when working on word aligned memory. I guess that your solution does lead to the correct answer, which basically is "nothing can be done", but I marked @MichaelDorgan's answer as correct, since he tells me the same thing, but expands on how I could improve on the situation by handling corner cases specially.Prickly
I
-3

When reading this SIMD-class documentation I found how to allocate variables, both statically and dynamically, with proper alignment. http://www.agner.org/optimize/vectorclass.pdf

Page 101

Windows, write:

__declspec(align(16)) int mydata[1000];

In Unix-like systems, write:

int mydata[1000] __attribute__((aligned(16)));

Page 16

If you need an array of a size that is determined at runtime, then you will have a problem with alignment. Each vector needs to be stored at an address divisible by 16, 32 or 64 bytes, according to its size. The compiler can do this when defining a fixed-size array, as in the above example, but not necessarily with dynamic memory allocation. If you create an array of dynamic size by using new, malloc or an STL container, or any other method, then you may not get the proper alignment for vectors and the program will very likely crash when accessing a misaligned vector. The C++ standard says "It is implementation defined if new-expression, [...] support over-aligned types". Possible solutions are to use posix_memalign, _aligned_malloc, std::aligned_storage, std::align, etc. depending on what is supported by your compiler, but the method may not be portable to all platforms.

Inactivate answered 24/4, 2015 at 20:58 Comment(3)
The problem is reading efficiently from the middle of an existing buffer, not allocating a new buffer of random alignment.Doublecross
Oh sorry, I misunderstood. But please dont down-vote my answer. It is still somewhat relevant to the topic.Inactivate
In C11, #include <stdalign.h> and alignas(16) int mydata[1000]; to do this portably.Quartile

© 2022 - 2024 — McMap. All rights reserved.