I'm doing a project on an ARM Cortex M0, which does not support unaligned(by 4bytes) access, and I'm trying to optimize the speed of operations on unaligned data.
I'm storing Bluetooth Low Energy Access addresses (48bit) as 6-byte arrays in some packed structs acting as packet buffers. Because of the packing, the BLE addresses are not necessarily starting at a word aligned address, and I'm running into some complications when optimizing my access functions to these addresses.
The first, and most obvious approach is a for loop operating on each byte in the array individually. Checking if two addresses are the same could for instance be done like this:
uint8_t ble_adv_addr_is_equal(uint8_t* addr1, uint8_t* addr2)
{
for (uint32_t i = 0; i < 6; ++i)
{
if (addr1[i] != addr2[i])
return 0;
}
return 1;
}
I'm doing a lot of comparisons in my project, and I wanted to see if I could squeeze some more speed out of this function. I realised that for aligned addresses, I could cast them to uint64_t, and compare with 48 bit masks applied, i.e.
((uint64_t)&addr1[0] & 0xFFFFFFFFFFFF) == ((uint64_t)&addr2[0] & 0xFFFFFFFFFFFF)
Similar operations could be done for writing, and it works well for aligned versions. However, since my addresses aren't always word-aligned (or even half-word), I would have to do some extra tricks to make this work.
First off, I came up with this unoptimized nightmare of a compiler macro:
#define ADDR_ALIGNED(_addr) (uint64_t)(((*((uint64_t*)(((uint32_t)_addr) & ~0x03)) >> (8*(((uint32_t)_addr) & 0x03))) & 0x000000FFFFFFFF)\
| (((*((uint64_t*)(((uint32_t)_addr+4) & ~0x03))) << (32-8*(((uint32_t)_addr) & 0x03)))) & 0x00FFFF00000000)
It basically shifts the entire address to start at the previous word aligned memory position, regardless of offset. For instance:
0 1 2 3
|-------|-------|-------|-------|
|.......|.......|.......|<ADDR0>|
|<ADDR1>|<ADDR2>|<ADDR3>|<ADDR4>|
|<ADDR5>|.......|.......|.......|
becomes
0 1 2 3
|-------|-------|-------|-------|
|<ADDR0>|<ADDR1>|<ADDR2>|<ADDR3>|
|<ADDR4>|<ADDR5>|.......|.......|
|.......|.......|.......|.......|
and I can safely do a 64-bit comparison of two addresses, regardless of their actual alignment:
ADDR_ALIGNED(addr1) == ADDR_ALIGNED(addr2)
Neat! But this operation takes 71 lines of assembly when compiled with the ARM-MDK, compared to 53 when doing the comparison in a simple for loop (I'm just going to disregard the additional time spent in the branch instructions here), and ~30 when unrolled. Also, it doesn't work for writes, as the alignment only happens in the registers, not in memory. Unaligning it again would require a similar operation, and the whole approach generally seems to suck.
Is an unrolled for-loop working each byte individually really the fastest solution for cases like this? Does anyone have any experience with similar scenarios, and feel like sharing some of their wizardry here?
uint64_t
? – Rajiv__attribute__((aligned(2))) struct BLEA { uint8_t data[6]; };
to start with, and making a copy into it if needed. – Womblememcpy()
andmemset()
implementation for the Cortex-M to see how the head/tail casing is done. I think aswitch(unaligned&3)
will be efficient, but the code will not be small. – Petua