ARM Cortex M7 unaligned access and memcpy
Asked Answered
E

2

7

I am compiling this code for a Cortex M7 using GCC:

// copy manually
void write_test_plain(uint8_t * ptr, uint32_t value)
{
    *ptr++ = (u8)(value);
    *ptr++ = (u8)(value >> 8);
    *ptr++ = (u8)(value >> 16);
    *ptr++ = (u8)(value >> 24); 
}

// copy using memcpy
void write_test_memcpy(uint8_t * ptr, uint32_t value)
{
    void *px = (void*)&value;
    memcpy(ptr, px, 4);
}

int main(void) 
{
    extern uint8_t data[];
    extern uint32_t value;

    // i added some offsets to data to
    // make sure the compiler cannot
    // assume it's aligned in memory

    write_test_plain(data + 2, value);
    __asm volatile("": : :"memory"); // just to split inlined calls
    write_test_memcpy(data + 5, value);

    ... do something with data ...
}

And I get the following Thumb2 assembly with -O2:

// write_test_plain(data + 2, value);
800031c:    2478        movs    r4, #120 ; 0x78
800031e:    2056        movs    r0, #86  ; 0x56
8000320:    2134        movs    r1, #52  ; 0x34
8000322:    2212        movs    r2, #18  ; 0x12
8000324:    759c        strb    r4, [r3, #22]
8000326:    75d8        strb    r0, [r3, #23]
8000328:    7619        strb    r1, [r3, #24]
800032a:    765a        strb    r2, [r3, #25]

// write_test_memcpy(data + 5, value);
800032c:    4ac4        ldr r2, [pc, #784]  ; (8000640 <main+0x3a0>)
800032e:    923b        str r2, [sp, #236]  ; 0xec
8000330:    983b        ldr r0, [sp, #236]  ; 0xec
8000332:    f8c3 0019   str.w   r0, [r3, #25]

Can someone explain how the memcpy version works? This looks like inlined 32-bit store to the destination address, but isn't this a problem since data + 5 is most certainly not aligned to a 4-byte boundary?

Is this perhaps some optimization which happens due to some undefined behavior in my source?

Eller answered 14/6, 2018 at 22:35 Comment(9)
what C library are you using?Regorge
Cortex-M7 should support unaligned writes with str instruction by default. This can be changed at runtime (forgot how the flag is called). You can also try using uint64_t, as STRD should trigger a fault when misaligned.Leven
@TurboJ: thanks, Johan mentioned the flag below, but do you know of a reason why someone would use this flag, if the controller supports aligned access?Eller
To be able to catch the - silghtly slower - unaligned access during development. Also for compatibility with armv6-m aka Cortex-M0.Leven
The flag is SCB.CCR.UNALIGN_TRP; note also that strongly ordered and device memories forbid unaligned accesses (regardless of this flag) which is usually where are mapped peripheral registers.Olimpia
Unaligned accesses to strongly ordered/device memories will trigger a hard fault even if SCB.CCR.UNALIGN_TRP is unset. I found that information in a Keil article.Erudition
@RobertSexton: thanks, can you provide a link for future reference? I don't remember this anymore, I know that usually it wasn't failing, so this depends on the type of memory, right?Eller
Yes, it depends on the type of memory, as described by @Olimpia above.Erudition
One thing to be aware of is that a modern compiler will try to emit aligned load/stores if it can tell that the source and destination are aligned. Otherwise you'll get the code in the example where the compiler is emitting 'safe' code. There are optimized memcpys out there that do this automatically, where they read/write the unaligned bytes before reading/writing in the native machine size. Aligned accesses are much faster. Try to make them possible.Erudition
D
6

For Cortex-M processors unaligned loads and stores of bytes, half-words, and words are usually allowed and most compilers use this when generating code unless they are instructed not to. If you want to prevent gcc from assuming the unaligned accesses are OK, you can use the -mno-unaligned-access compiler flag.

If you specify this flag gcc will no longer inline the call to memcpy and write_test_memcpy looks like

write_test_memcpy(unsigned char*, unsigned long):
  push {lr}
  sub sp, sp, #12
  movs r2, #4
  add r3, sp, #8
  str r1, [r3, #-4]!
  mov r1, r3
  bl memcpy
  add sp, sp, #12
  ldr pc, [sp], #4
Detrain answered 15/6, 2018 at 12:20 Comment(4)
Thanks! I found out this chapter in the ARM M7 User Guide and it indeed states that STR and LDR can use unaligned access, however I don't understand the remark that "Unaligned accesses are usually slower than aligned accesses" -- there is nothing in the manual that indicates that LRD and STR can have cycle counts different that 1 and 2 cycles, repectivelly. Do you know what is meant by that, or where I could find the information about cycle counts in those cases?Eller
"For Cortex-M processors unaligned loads and stores of bytes, half-words, and words are usually allowed" that's blatantly wrong. It depends on the configuration and is strongly deprecated, expecially for data on the stack. It also is much less performant where enabled.Pacific
@Olaf: thanks, but I couldn't find the information on the performance impact anywhere in the ARM docs, these instructions (LDR, STR) have their exact cycle counts stated in the manual and there are no mentions of any different cycle counts that I could find.Eller
The LDR/STR instructions clearly state there are to add the cycles for the memory accesses. Plus non-aligned R/W can't use LDRD/STRD/LDRM/STRM, likely uses a cache-line partially only and takes up to 3 accesses per word. And that's only if unaligned accesses are enabled at all! Which is often not true, simply because if you follow strict aliasing rule, clean code does not have to perform cross-alignment copies often. Finally: in embedded code you should not copy much. Simply for overal performance reasons.Pacific
C
3

Cortex-M 7 , M4, M3 M33, M23 does support unaligned access M0, M+ doesn't support unaligned access

however you can disable the support of unaligned access in cortexm7 by setting bit UNALIGN_TRP in configuration and control register and any unaligned access will generate usage fault.

From compiler perspective, default setting is that generated assembly code does unaligned access unless you disable this by using the compile flag -mno-unaligned-access

Choline answered 17/2, 2021 at 15:24 Comment(3)
Thanks. Actually we had a problem once even with unaligned access allowed, when the read was at an address boundary between some parts of memory. I wasn't directly involved so I am not sure about specifics, but IIRC the last 16 bits in the 32-bit word were zero unless we forced the compiler to read as individual bytes.Eller
@lou was that on an ST MCU? they often have two banks of SRAM where this behavior would make sense I thinkAlfredalfreda
@Alfredalfreda sorry haven't visited SO for a while. Sorry I don't recall the details anymore, but there was an issue where a 32-bit value was unaligned and exactly on the boundary between two address. So the mcu would read the first half from the first memory module, and the rest were zeros.Eller

© 2022 - 2024 — McMap. All rights reserved.