RGBA to ABGR: Inline arm neon asm for iOS/Xcode
Asked Answered
W

3

3

This code(very similar code, haven't tried exactly this code) compiles using Android NDK, but not with Xcode/armv7+arm64/iOS

Errors in comments:

uint32_t *src;
uint32_t *dst;

#ifdef __ARM_NEON
__asm__ volatile(
    "vld1.32 {d0, d1}, [%[src]] \n" // error: Vector register expected
    "vrev32.8 q0, q0            \n" // error: Unrecognized instruction mnemonic
    "vst1.32 {d0, d1}, [%[dst]] \n" // error: Vector register expected
    :
    : [src]"r"(src), [dst]"r"(dst)
    : "d0", "d1"
    );
#endif

What's wrong with this code?

EDIT1:

I rewrote the code using intrinsics:

uint8x16_t x = vreinterpretq_u8_u32(vld1q_u32(src));
uint8x16_t y = vrev32q_u8(x);
vst1q_u32(dst, vreinterpretq_u32_u8(y));

After disassembling, I get the following, which is a variation I have already tried:

vld1.32 {d16, d17}, [r0]!
vrev32.8    q8, q8
vst1.32 {d16, d17}, [r1]!

So my code looks like this now, but gives the exact same errors:

__asm__ volatile("vld1.32 {d0, d1}, [%0]! \n"
                 "vrev32.8 q0, q0         \n"
                 "vst1.32 {d0, d1}, [%1]! \n"
                 :
                 : "r"(src), "r"(dst)
                 : "d0", "d1"
                 );

EDIT2:

Reading through the disassembly, I actually found a second version of the function. It turns out that arm64 uses a slightly different instruction set. For example, the arm64 assembly uses rev32.16b v0, v0 instead. The whole function listing(which I can't make heads or tails of) is below:

_My_Function:
cmp     w2, #0
add w9, w2, #3
csel    w8, w9, w2, lt
cmp     w9, #7
b.lo    0x3f4
asr w9, w8, #2
ldr     x8, [x0]
mov  w9, w9
lsl x9, x9, #2
ldr q0, [x8], #16
rev32.16b   v0, v0
str q0, [x1], #16
sub x9, x9, #16
cbnz    x9, 0x3e0
ret
Whyalla answered 26/6, 2016 at 5:0 Comment(10)
Searching xcode's build settings for the iOS target, I don't see anything neon related. According to the following question, neon is enabled by default: stackoverflow.com/questions/2376317/…Whyalla
Also, your code has a nasty bug: dst is an input operand, not an output operand, so the compiler assumes that register still holds the old value. So even when it compiles, you shouldn't expect it to work. See the inline assembly tag wiki for some links. IDK NEON asm syntax very well, but why don't you do the loads with intrinsics or uint64_t, and leave the load / store to the compiler? Or if possible, write it in a way that gives the compiler a choice of memory or register destination. (an "=rm" output constraint)Kept
Also, I think there are C intrinsics for NEON. That's usually the best option, because it lets the compiler know what's happening, so all the usual optimizations can apply. If you fix that src/dest bug, you could at least remove the volatile, but even then constant-propagation can't happen through inline asm after inlining, and there are various other ways that you'll get worse code. See gcc.gnu.org/wiki/DontUseInlineAsmKept
having dst as an input operand is correct. the value of the pointer is not modified, only the data that it points to.Whyalla
Oh right. But it's still not correct without a "memory" clobber. A better and also correct way to do it is: [dst] "=m" (*dst). (Cast the pointer to something of the correct size, so the compiler knows exactly which bytes of memory are modified, and can assume everything else is unmodified.) Of course, if you ever use this function in a context where it would be useful to have the output in a register, you're shooting yourself in the foot by forcing an extra store-forwarding round trip (probably several cycles of latency).Kept
Reversing is just one trivial instruction, and is something you should be doing at the same time as other things unless you have multiple passes over the same image so you can save one reverse in a lot more work. Or I guess a copy-and-flip is fine if you need a copy anyway.Kept
You also probably don't need to force it to use q0. You could use a dummy output constraint to let the compiler pick scratch regs for you. (Although IDK how you'd go about deriving the right d0 / d1 for a given q register; There might not be any GNU C inline-asm syntax for that on ARM.)Kept
As for the actual compile error, a search for vld1.32 turns up this: vld1.32 {d0-d3}, [r2]. Note that the destination elements are delimited with a hyphen, not a comma. Perhaps this is how vector elements are specified? (Note: I don't speak NEON).Briard
@DavidWohlferd tried that too. I think that when it's two consecutive registers, it can be either one, but that "-" can additionally be used for ranges of registers.Whyalla
BTW, using memory operands for source / dest gives the compiler the option of using whatever addressing mode it wants, so definitely do that unless you want to constrain it to using the [reg] addressing mode as opposed to a post-increment addressing mode. That's still worse than intrinsics, but this is still a valid question for cases where inline asm is a good choice.Kept
W
-2

As stated in the edits to the original question, it turned out that I needed a different assembly implementation for arm64 and armv7.

#ifdef __ARM_NEON
  #if __LP64__
asm volatile("ldr q0, [%0], #16  \n"
             "rev32.16b v0, v0   \n"
             "str q0, [%1], #16  \n"
             : "=r"(src), "=r"(dst)
             : "r"(src), "r"(dst)
             : "d0", "d1"
             );
  #else
asm volatile("vld1.32 {d0, d1}, [%0]! \n"
             "vrev32.8 q0, q0         \n"
             "vst1.32 {d0, d1}, [%1]! \n"
             : "=r"(src), "=r"(dst)
             : "r"(src), "r"(dst)
             : "d0", "d1"
             );
  #endif
#else

The intrinsics code that I posted in the original post generated surprisingly good assembly though, and also generated the arm64 version for me, so it may be a better idea to use intrinsics instead in the future.

Whyalla answered 26/6, 2016 at 16:29 Comment(14)
Comments are not for extended discussion; this conversation has been moved to chat.Lustrate
This code introduces a new bug that the code in the question didn't have. (These constraints don't require the src output operand to be in the same reg as the src input operand. See the last section of my answer). Or better: loop over arrays with "<>m" memory operands to let the compiler use auto-increment/decrement addressing modes. (Or even better, use intrinsics).Kept
There is no bug in this code. The assembly increments the value of src and dst. That's why they're in the list of output operands.Whyalla
Your constraints as written tell the compiler that your asm can produce the output values in a different register than the input values. You're assuming that the compiler will choose to keep the values in the same registers, instead of requiring it by using "+r"(src), "+r"(dst) (without separate input operands). The other way to write it is : "=r"(src), "=r"(dst) : "0"(src), "1"(dst) (the "0" and "1" matching constraints refer to operand numbers, not register numbers). I explained all this in my answer.Kept
That's a potential optimization problem, not a bug. The fact is, the inline asm takes the pointers as inputs, increments them, and then outputs them. There is nothing logically wrong with the code. So again, I will refer you to codereview.stackexchange.com.Whyalla
No, that's backwards. Your inline asm uses the output operand (%0) as an input, ignoring the src input operand (%2) based on the assumption that %0 and %2 are in the same register. See my previous comment for how to use constraints to express that requirement.Kept
When I finally got around to testing this code on a 64bit device, I noticed some strange behaviour, at which point, I realized that I was using the wrong arrangement specifier on rev32(8b instead of 16b). I just think it's ironic that after all this, the only real bug in my code is the one you didn't find ;)Whyalla
Note that as a loop this looks pretty horrible for performance - on many cores it may well spend more time stalled on register dependencies than actually processing anything - but unlike the intrinsic version (which you only need write once) has no chance of getting any better. If you want it to be fast, you'll want the compiler to be able to unroll the loop, use all the available registers, pipeline the processing instructions, schedule the loads effectively, avoid writeback addressing modes, and so on. If you don't need it to be fast, there's no need to go beyond plain C operators...Aalborg
@Aalborg I did some testing on my final code. The run times for 200 iterations on a 2048x2048 image were 2376ms, 1573ms, and 1468ms for C code, neon instrinsics and inline neon asm respectively. Compiled with loop unrolling on, and -Ofast. - pastie.org/10893450Whyalla
I added a test-case to my answer that demonstrates how to get wrong code from the asm statement in this answer. Your code happens to work in your program, but this SO answer claims to be correct as a stand-alone fragment, not dependent on any surrounding context. Also, I think @Aalborg was talking about in-order cores. You're probably testing on an out-of-order core, where it's not a problem to use the result of a load right away.Kept
Also, from what I can find, I thought rev32.16b would mean a swap of 16bit halves of each 32bit element. I don't have a convenient ARM setup to test on, but assuming your test harness works, I guess that's not the case? I only answered to correct your use of constraints, not to debug the NEON asm. But I got curious and tried to research that, too.Kept
@PeterCordes For 32bit arm, the '8' in vrev32.8 is the size of each chunk in bits, but in 64bit arm asm, the '16' in rev32.16b is the number of chunks, and 'b' is the size. So the options are, 8B, 16B, 4H, 8H, 2S, 4S and 2D, where 'B' ='bytes', 'H'='half-words', 'S'='single-words' and 'D'='double-words'. This information was surprisingly annoying to find.Whyalla
@bitwise: Ahh, that makes sense, thanks for clarifying. Apparently AArch64 introduced narrowing/widening instructions, so they need a syntax where the vector width isn't implicit. The page I linked described putting the suffixes on the operands. Is putting it on the mnemonic always allowed when it's the same for both operands?Kept
@PeterCordes Not quite sure yet. I found an SO post saying it was compiler dependant, but XCode seems to accept both. I haven't ran the code yet, but "ld1.16b {v0, v1, v2, v3}, [%0], #64" and "ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%0], #64" both compile.Whyalla
R
3

I have successfully published several iOS apps which make use of ARM assembly language and inline code is the most frustrating way to do it. Apple still requires apps to support both ARM32 and ARM64 devices. Since the code will be built as both ARM32 and ARM64 by default (unless you changed the compile options), you need to design code which will successfully compile in both modes. As you noticed, ARM64 is a completely different mnemonic format and register model. There are 2 simple ways around this:

1) Write your code using NEON intrinsics. ARM specified that the original ARM32 intrinsics would remain mostly unchanged for ARMv8 targets and therefore can be compiled to both ARM32 and ARM64 code. This is the safest/easiest option.

2) Write inline code or a separate '.S' module for your assembly language code. To deal with the 2 compile modes, use "#ifdef __arm64__" and "#ifdef __arm__" to distinguish between the two instruction sets.

Realpolitik answered 26/6, 2016 at 10:16 Comment(0)
K
3

Intrinsics are apparently the only way to use the same code for NEON between ARM (32-bit) and AArch64.

There are many reasons not to use : https://gcc.gnu.org/wiki/DontUseInlineAsm

Unfortunately, current compilers often do a very poor job with ARM / AArch64 intrinsics, which is surprising because they do an excellent job optimizing x86 SSE/AVX intrinsics and PowerPC Altivec. They often do ok in simple cases, but can easily introduce extra store/reloads.

In theory with intrinsics, you should get good asm output, and it lets the compiler schedule instructions between the vector load and store, which will help most on an in-order core. (Or you could write a whole loop in inline asm that you schedule by hand.)

ARM's official documentation:

Although it is technically possible to optimize NEON assembly by hand, this can be very difficult because the pipeline and memory access timings have complex inter-dependencies. Instead of hand assembly, ARM strongly recommends the use of intrinsics


If you do use inline asm anyway, avoid future pain by getting it right.

It's easy to write inline asm that happens to work, but isn't safe wrt. future source changes (and sometimes to future compiler optimizations), because the constraints don't accurately describe what the asm does. The symptoms will be weird, and this kind of context-sensitive bug could even lead to unit tests passing but wrong code in the main program. (or vice versa).

A latent bug that doesn't cause any defects in the current build is still a bug, and is a really Bad Thing in a Stackoverflow answer that can be copied as an example into other contexts. @bitwise's code in the question and self-answer both have bugs like this.

The inline asm in the question isn't safe, because it modifies memory telling the compiler about it. This probably only manifests in a loop that reads from dst in C both before and after the inline asm. However, it's easy to fix, and doing so lets us drop the volatile (and the `"memory" clobber which it's missing) so the compiler can optimize better (but still with significant limitations compared to intrinsics).

volatile should prevent reordering relative to memory accesses, so it may not happen outside of fairly contrived circumstances. But that's hard to prove.


The following compiles for ARM and AArch64 (it might fail if compiling for ILP32 on AArch64, though, I forgot about that possibility). Using -funroll-loops leads to gcc choosing different addressing modes, and not forcing the dst++; src++; to happen between every inline asm statement. (This maybe wouldn't be possible with asm volatile).

I used memory operands so the compiler knows that memory is an input and an output, and giving the compiler the option to use auto-increment / decrement addressing modes. This is better than anything you can do with a pointer in a register as an input operand, because it allows loop unrolling to work.

This still doesn't let the compiler schedule the store many instructions after the corresponding load to software pipeline the loop for in-order cores, so it's probably only going to perform decently on out-of-order ARM chips.

void bytereverse32(uint32_t *dst32, const uint32_t *src32, size_t len)
{
    typedef struct { uint64_t low, high; } vec128_t;
    const vec128_t *src = (const vec128_t*) src32;
    vec128_t *dst = (vec128_t*) dst32;

    // with old gcc, this gets gcc to use a pointer compare as the loop condition
    // instead of incrementing a loop counter
    const vec128_t *src_endp = src + len/(sizeof(vec128_t)/sizeof(uint32_t));
    // len is in units of 4-byte chunks

    while (src < src_endp) {

        #if defined(__ARM_NEON__) || defined(__ARM_NEON)
          #if __LP64__   // FIXME: doesn't account for ILP32 in 64-bit mode
        // aarch64 registers: s0 and d0 are subsets of q0 (128bit), synonym for v0
        asm ("ldr        q0, %[src] \n\t"
             "rev32.16b  v0, v0 \n\t"
             "str        q0, %[dst]  \n\t"
                     : [dst] "=<>m"(*dst)  // auto-increment/decrement or "normal" memory operand
                     : [src] "<>m" (*src)
                     : "q0", "v0"
                     );
          #else
        // arm32 registers: 128bit q0 is made of d0:d1, or s0:s3
        asm ("vld1.32   {d0, d1}, %[src] \n\t"
             "vrev32.8   q0, q0          \n\t"  // reverse 8 bit elements inside 32bit words
             "vst1.32   {d0, d1}, %[dst] \n"
                     : [dst] "=<>m"(*dst)
                     : [src] "<>m"(*src)
                     : "d0", "d1"
                     );
          #endif
        #else
         #error "no NEON"
        #endif

      // increment pointers by 16 bytes
        src++;   // The inline asm doesn't modify the pointers.
        dst++;   // of course, these increments may compile to a post-increment addressing mode
                 // this way has the advantage of letting the compiler unroll or whatever

     }
}

This compiles (on the Godbolt compiler explorer with gcc 4.8), but I don't know if it assembles, let alone works correctly. Still, I'm confident these operand constraints are correct. Constraints are basically the same across all architectures, and I understand them much better than I know NEON.

Anyway, the inner loop on ARM (32bit) with gcc 4.8 -O3, without -funroll-loops is:

.L4:
    vld1.32   {d0, d1}, [r1], #16   @ MEM[(const struct vec128_t *)src32_17]
    vrev32.8   q0, q0          
    vst1.32   {d0, d1}, [r0], #16   @ MEM[(struct vec128_t *)dst32_18]

    cmp     r3, r1    @ src_endp, src32
    bhi     .L4       @,

The register constraint bug

The code in the OP's self-answer has another bug: The input pointer operands uses separate "r" constraints. This leads to breakage if the compiler wants to keep the old value around, and chooses an input register for src that isn't the same as the output register.

If you want to take pointer inputs in registers and choose your own addressing modes, you can use "0" matching-constraints, or you can use "+r" read-write output operands.

You will also need a "memory" clobber or dummy memory input/output operands (i.e. that tell the compiler which bytes of memory are read and written, even if you don't use that operand number in the inline asm).

See Looping over arrays with inline assembly for a discussion of the advantages and disadvantages of using r constraints for looping over an array on x86. ARM has auto-increment addressing modes, which appear to produce better code than anything you could get with manual choice of addressing modes. It lets gcc use different addressing modes in different copies of the block when loop-unrolling. "r" (pointer) constraints appear to have no advantage, so I won't go into detail about how to use a dummy input / output constraint to avoid needing a "memory" clobber.


Test-case that generates wrong code with @bitwise's asm statement:

// return a value as a way to tell the compiler it's needed after
uint32_t* unsafe_asm(uint32_t *dst, const uint32_t *src)
{
  uint32_t *orig_dst = dst;

  uint32_t initial_dst0val = orig_dst[0];
#ifdef __ARM_NEON
  #if __LP64__
asm volatile("ldr q0, [%0], #16   # unused src input was %2\n\t"
             "rev32.16b v0, v0   \n\t"
             "str q0, [%1], #16   # unused dst input was %3\n"
             : "=r"(src), "=r"(dst)
             : "r"(src), "r"(dst)
             : "d0", "d1"  // ,"memory"
               // clobbers don't include v0?
            );
  #else
asm volatile("vld1.32 {d0, d1}, [%0]!  # unused src input was %2\n\t"
             "vrev32.8 q0, q0         \n\t"
             "vst1.32 {d0, d1}, [%1]!  # unused dst input was %3\n"
             : "=r"(src), "=r"(dst)
             : "r"(src), "r"(dst)
             : "d0", "d1" // ,"memory"
             );
  #endif
#else
    #error "No NEON/AdvSIMD"
#endif

  uint32_t final_dst0val = orig_dst[0];
  // gcc assumes the asm doesn't change orig_dst[0], so it only does one load (after the asm)
  // and uses it for final and initial
  // uncomment the memory clobber, or use a dummy output operand, to avoid this.
  // pointer + initial+final compiles to LSL 3 to multiply by 8 = 2 * sizeof(uint32_t)


  // using orig_dst after the inline asm makes the compiler choose different registers for the
  // "=r"(dst) output operand and the "r"(dst) input operand, since the asm constraints
  // advertise this non-destructive capability.
  return orig_dst + final_dst0val + initial_dst0val;
}

This compiles to (AArch64 gcc4.8 -O3):

    ldr q0, [x1], #16   # unused src input was x1   // src, src
    rev32.16b v0, v0   
    str q0, [x2], #16   # unused dst input was x0   // dst, dst

    ldr     w1, [x0]  // D.2576, *dst_1(D)
    add     x0, x0, x1, lsl 3 //, dst, D.2576,
    ret

The store uses x2 (an uninitialized register, since this function only takes 2 args). The "=r"(dst) output (%1) picked x2, while the "r"(dst) input (%3 which is used only in a comment) picked x0.

final_dst0val + initial_dst0val compiles to 2x final_dst0val, because we lied to the compiler and told it that memory wasn't modified. So instead of reading the same memory before and after the inline asm statement, it just reads after and left-shifts by one extra position when adding to the pointer. (The return value exists only to use the values so they're not optimized away).

We can fix both problems by correcting the constraints: using "+r" for the pointers and adding a "memory" clobber. (A dummy output would also work, and might hurt optimization less.) I didn't bother since this appears to have no advantage over the memory-operand version above.

With those changes, we get

safe_register_pointer_asm:
    ldr     w3, [x0]  //, *dst_1(D)
    mov     x2, x0    // dst, dst    ### These 2 insns are new

    ldr q0, [x1], #16       // src
    rev32.16b v0, v0   
    str q0, [x2], #16       // dst

    ldr     w1, [x0]  // D.2597, *dst_1(D)
    add     x3, x1, x3, uxtw  // D.2597, D.2597, initial_dst0val   ## And this is new, to add the before and after loads
    add     x0, x0, x3, lsl 2 //, dst, D.2597,
    ret
Kept answered 27/6, 2016 at 6:43 Comment(0)
W
-2

As stated in the edits to the original question, it turned out that I needed a different assembly implementation for arm64 and armv7.

#ifdef __ARM_NEON
  #if __LP64__
asm volatile("ldr q0, [%0], #16  \n"
             "rev32.16b v0, v0   \n"
             "str q0, [%1], #16  \n"
             : "=r"(src), "=r"(dst)
             : "r"(src), "r"(dst)
             : "d0", "d1"
             );
  #else
asm volatile("vld1.32 {d0, d1}, [%0]! \n"
             "vrev32.8 q0, q0         \n"
             "vst1.32 {d0, d1}, [%1]! \n"
             : "=r"(src), "=r"(dst)
             : "r"(src), "r"(dst)
             : "d0", "d1"
             );
  #endif
#else

The intrinsics code that I posted in the original post generated surprisingly good assembly though, and also generated the arm64 version for me, so it may be a better idea to use intrinsics instead in the future.

Whyalla answered 26/6, 2016 at 16:29 Comment(14)
Comments are not for extended discussion; this conversation has been moved to chat.Lustrate
This code introduces a new bug that the code in the question didn't have. (These constraints don't require the src output operand to be in the same reg as the src input operand. See the last section of my answer). Or better: loop over arrays with "<>m" memory operands to let the compiler use auto-increment/decrement addressing modes. (Or even better, use intrinsics).Kept
There is no bug in this code. The assembly increments the value of src and dst. That's why they're in the list of output operands.Whyalla
Your constraints as written tell the compiler that your asm can produce the output values in a different register than the input values. You're assuming that the compiler will choose to keep the values in the same registers, instead of requiring it by using "+r"(src), "+r"(dst) (without separate input operands). The other way to write it is : "=r"(src), "=r"(dst) : "0"(src), "1"(dst) (the "0" and "1" matching constraints refer to operand numbers, not register numbers). I explained all this in my answer.Kept
That's a potential optimization problem, not a bug. The fact is, the inline asm takes the pointers as inputs, increments them, and then outputs them. There is nothing logically wrong with the code. So again, I will refer you to codereview.stackexchange.com.Whyalla
No, that's backwards. Your inline asm uses the output operand (%0) as an input, ignoring the src input operand (%2) based on the assumption that %0 and %2 are in the same register. See my previous comment for how to use constraints to express that requirement.Kept
When I finally got around to testing this code on a 64bit device, I noticed some strange behaviour, at which point, I realized that I was using the wrong arrangement specifier on rev32(8b instead of 16b). I just think it's ironic that after all this, the only real bug in my code is the one you didn't find ;)Whyalla
Note that as a loop this looks pretty horrible for performance - on many cores it may well spend more time stalled on register dependencies than actually processing anything - but unlike the intrinsic version (which you only need write once) has no chance of getting any better. If you want it to be fast, you'll want the compiler to be able to unroll the loop, use all the available registers, pipeline the processing instructions, schedule the loads effectively, avoid writeback addressing modes, and so on. If you don't need it to be fast, there's no need to go beyond plain C operators...Aalborg
@Aalborg I did some testing on my final code. The run times for 200 iterations on a 2048x2048 image were 2376ms, 1573ms, and 1468ms for C code, neon instrinsics and inline neon asm respectively. Compiled with loop unrolling on, and -Ofast. - pastie.org/10893450Whyalla
I added a test-case to my answer that demonstrates how to get wrong code from the asm statement in this answer. Your code happens to work in your program, but this SO answer claims to be correct as a stand-alone fragment, not dependent on any surrounding context. Also, I think @Aalborg was talking about in-order cores. You're probably testing on an out-of-order core, where it's not a problem to use the result of a load right away.Kept
Also, from what I can find, I thought rev32.16b would mean a swap of 16bit halves of each 32bit element. I don't have a convenient ARM setup to test on, but assuming your test harness works, I guess that's not the case? I only answered to correct your use of constraints, not to debug the NEON asm. But I got curious and tried to research that, too.Kept
@PeterCordes For 32bit arm, the '8' in vrev32.8 is the size of each chunk in bits, but in 64bit arm asm, the '16' in rev32.16b is the number of chunks, and 'b' is the size. So the options are, 8B, 16B, 4H, 8H, 2S, 4S and 2D, where 'B' ='bytes', 'H'='half-words', 'S'='single-words' and 'D'='double-words'. This information was surprisingly annoying to find.Whyalla
@bitwise: Ahh, that makes sense, thanks for clarifying. Apparently AArch64 introduced narrowing/widening instructions, so they need a syntax where the vector width isn't implicit. The page I linked described putting the suffixes on the operands. Is putting it on the mnemonic always allowed when it's the same for both operands?Kept
@PeterCordes Not quite sure yet. I found an SO post saying it was compiler dependant, but XCode seems to accept both. I haven't ran the code yet, but "ld1.16b {v0, v1, v2, v3}, [%0], #64" and "ld1 {v0.16b, v1.16b, v2.16b, v3.16b}, [%0], #64" both compile.Whyalla

© 2022 - 2024 — McMap. All rights reserved.