Intrinsics are apparently the only way to use the same code for NEON between ARM (32-bit) and AArch64.
There are many reasons not to use inline-assembly: https://gcc.gnu.org/wiki/DontUseInlineAsm
Unfortunately, current compilers often do a very poor job with ARM / AArch64 intrinsics, which is surprising because they do an excellent job optimizing x86 SSE/AVX intrinsics and PowerPC Altivec. They often do ok in simple cases, but can easily introduce extra store/reloads.
In theory with intrinsics, you should get good asm output, and it lets the compiler schedule instructions between the vector load and store, which will help most on an in-order core. (Or you could write a whole loop in inline asm that you schedule by hand.)
ARM's official documentation:
Although it is technically possible to optimize NEON assembly by hand, this can be very difficult because the pipeline and memory access timings have complex inter-dependencies. Instead of hand assembly, ARM strongly recommends the use of intrinsics
If you do use inline asm anyway, avoid future pain by getting it right.
It's easy to write inline asm that happens to work, but isn't safe wrt. future source changes (and sometimes to future compiler optimizations), because the constraints don't accurately describe what the asm does. The symptoms will be weird, and this kind of context-sensitive bug could even lead to unit tests passing but wrong code in the main program. (or vice versa).
A latent bug that doesn't cause any defects in the current build is still a bug, and is a really Bad Thing in a Stackoverflow answer that can be copied as an example into other contexts. @bitwise's code in the question and self-answer both have bugs like this.
The inline asm in the question isn't safe, because it modifies memory telling the compiler about it. This probably only manifests in a loop that reads from dst
in C both before and after the inline asm. However, it's easy to fix, and doing so lets us drop the volatile
(and the `"memory" clobber which it's missing) so the compiler can optimize better (but still with significant limitations compared to intrinsics).
volatile
should prevent reordering relative to memory accesses, so it may not happen outside of fairly contrived circumstances. But that's hard to prove.
The following compiles for ARM and AArch64 (it might fail if compiling for ILP32 on AArch64, though, I forgot about that possibility). Using -funroll-loops
leads to gcc choosing different addressing modes, and not forcing the dst++; src++;
to happen between every inline asm statement. (This maybe wouldn't be possible with asm volatile
).
I used memory operands so the compiler knows that memory is an input and an output, and giving the compiler the option to use auto-increment / decrement addressing modes. This is better than anything you can do with a pointer in a register as an input operand, because it allows loop unrolling to work.
This still doesn't let the compiler schedule the store many instructions after the corresponding load to software pipeline the loop for in-order cores, so it's probably only going to perform decently on out-of-order ARM chips.
void bytereverse32(uint32_t *dst32, const uint32_t *src32, size_t len)
{
typedef struct { uint64_t low, high; } vec128_t;
const vec128_t *src = (const vec128_t*) src32;
vec128_t *dst = (vec128_t*) dst32;
// with old gcc, this gets gcc to use a pointer compare as the loop condition
// instead of incrementing a loop counter
const vec128_t *src_endp = src + len/(sizeof(vec128_t)/sizeof(uint32_t));
// len is in units of 4-byte chunks
while (src < src_endp) {
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
#if __LP64__ // FIXME: doesn't account for ILP32 in 64-bit mode
// aarch64 registers: s0 and d0 are subsets of q0 (128bit), synonym for v0
asm ("ldr q0, %[src] \n\t"
"rev32.16b v0, v0 \n\t"
"str q0, %[dst] \n\t"
: [dst] "=<>m"(*dst) // auto-increment/decrement or "normal" memory operand
: [src] "<>m" (*src)
: "q0", "v0"
);
#else
// arm32 registers: 128bit q0 is made of d0:d1, or s0:s3
asm ("vld1.32 {d0, d1}, %[src] \n\t"
"vrev32.8 q0, q0 \n\t" // reverse 8 bit elements inside 32bit words
"vst1.32 {d0, d1}, %[dst] \n"
: [dst] "=<>m"(*dst)
: [src] "<>m"(*src)
: "d0", "d1"
);
#endif
#else
#error "no NEON"
#endif
// increment pointers by 16 bytes
src++; // The inline asm doesn't modify the pointers.
dst++; // of course, these increments may compile to a post-increment addressing mode
// this way has the advantage of letting the compiler unroll or whatever
}
}
This compiles (on the Godbolt compiler explorer with gcc 4.8), but I don't know if it assembles, let alone works correctly. Still, I'm confident these operand constraints are correct. Constraints are basically the same across all architectures, and I understand them much better than I know NEON.
Anyway, the inner loop on ARM (32bit) with gcc 4.8 -O3, without -funroll-loops
is:
.L4:
vld1.32 {d0, d1}, [r1], #16 @ MEM[(const struct vec128_t *)src32_17]
vrev32.8 q0, q0
vst1.32 {d0, d1}, [r0], #16 @ MEM[(struct vec128_t *)dst32_18]
cmp r3, r1 @ src_endp, src32
bhi .L4 @,
The register constraint bug
The code in the OP's self-answer has another bug: The input pointer operands uses separate "r"
constraints. This leads to breakage if the compiler wants to keep the old value around, and chooses an input register for src
that isn't the same as the output register.
If you want to take pointer inputs in registers and choose your own addressing modes, you can use "0"
matching-constraints, or you can use "+r"
read-write output operands.
You will also need a "memory"
clobber or dummy memory input/output operands (i.e. that tell the compiler which bytes of memory are read and written, even if you don't use that operand number in the inline asm).
See Looping over arrays with inline assembly for a discussion of the advantages and disadvantages of using r
constraints for looping over an array on x86. ARM has auto-increment addressing modes, which appear to produce better code than anything you could get with manual choice of addressing modes. It lets gcc use different addressing modes in different copies of the block when loop-unrolling. "r" (pointer)
constraints appear to have no advantage, so I won't go into detail about how to use a dummy input / output constraint to avoid needing a "memory"
clobber.
Test-case that generates wrong code with @bitwise's asm statement:
// return a value as a way to tell the compiler it's needed after
uint32_t* unsafe_asm(uint32_t *dst, const uint32_t *src)
{
uint32_t *orig_dst = dst;
uint32_t initial_dst0val = orig_dst[0];
#ifdef __ARM_NEON
#if __LP64__
asm volatile("ldr q0, [%0], #16 # unused src input was %2\n\t"
"rev32.16b v0, v0 \n\t"
"str q0, [%1], #16 # unused dst input was %3\n"
: "=r"(src), "=r"(dst)
: "r"(src), "r"(dst)
: "d0", "d1" // ,"memory"
// clobbers don't include v0?
);
#else
asm volatile("vld1.32 {d0, d1}, [%0]! # unused src input was %2\n\t"
"vrev32.8 q0, q0 \n\t"
"vst1.32 {d0, d1}, [%1]! # unused dst input was %3\n"
: "=r"(src), "=r"(dst)
: "r"(src), "r"(dst)
: "d0", "d1" // ,"memory"
);
#endif
#else
#error "No NEON/AdvSIMD"
#endif
uint32_t final_dst0val = orig_dst[0];
// gcc assumes the asm doesn't change orig_dst[0], so it only does one load (after the asm)
// and uses it for final and initial
// uncomment the memory clobber, or use a dummy output operand, to avoid this.
// pointer + initial+final compiles to LSL 3 to multiply by 8 = 2 * sizeof(uint32_t)
// using orig_dst after the inline asm makes the compiler choose different registers for the
// "=r"(dst) output operand and the "r"(dst) input operand, since the asm constraints
// advertise this non-destructive capability.
return orig_dst + final_dst0val + initial_dst0val;
}
This compiles to (AArch64 gcc4.8 -O3):
ldr q0, [x1], #16 # unused src input was x1 // src, src
rev32.16b v0, v0
str q0, [x2], #16 # unused dst input was x0 // dst, dst
ldr w1, [x0] // D.2576, *dst_1(D)
add x0, x0, x1, lsl 3 //, dst, D.2576,
ret
The store uses x2
(an uninitialized register, since this function only takes 2 args). The "=r"(dst)
output (%1) picked x2
, while the "r"(dst)
input (%3 which is used only in a comment) picked x0
.
final_dst0val + initial_dst0val
compiles to 2x final_dst0val
, because we lied to the compiler and told it that memory wasn't modified. So instead of reading the same memory before and after the inline asm statement, it just reads after and left-shifts by one extra position when adding to the pointer. (The return value exists only to use the values so they're not optimized away).
We can fix both problems by correcting the constraints: using "+r"
for the pointers and adding a "memory"
clobber. (A dummy output would also work, and might hurt optimization less.) I didn't bother since this appears to have no advantage over the memory-operand version above.
With those changes, we get
safe_register_pointer_asm:
ldr w3, [x0] //, *dst_1(D)
mov x2, x0 // dst, dst ### These 2 insns are new
ldr q0, [x1], #16 // src
rev32.16b v0, v0
str q0, [x2], #16 // dst
ldr w1, [x0] // D.2597, *dst_1(D)
add x3, x1, x3, uxtw // D.2597, D.2597, initial_dst0val ## And this is new, to add the before and after loads
add x0, x0, x3, lsl 2 //, dst, D.2597,
ret
dst
is an input operand, not an output operand, so the compiler assumes that register still holds the old value. So even when it compiles, you shouldn't expect it to work. See the inline assembly tag wiki for some links. IDK NEON asm syntax very well, but why don't you do the loads with intrinsics or uint64_t, and leave the load / store to the compiler? Or if possible, write it in a way that gives the compiler a choice of memory or register destination. (an"=rm"
output constraint) – Keptvolatile
, but even then constant-propagation can't happen through inline asm after inlining, and there are various other ways that you'll get worse code. See gcc.gnu.org/wiki/DontUseInlineAsm – Kept"memory"
clobber. A better and also correct way to do it is:[dst] "=m" (*dst)
. (Cast the pointer to something of the correct size, so the compiler knows exactly which bytes of memory are modified, and can assume everything else is unmodified.) Of course, if you ever use this function in a context where it would be useful to have the output in a register, you're shooting yourself in the foot by forcing an extra store-forwarding round trip (probably several cycles of latency). – Keptq0
. You could use a dummy output constraint to let the compiler pick scratch regs for you. (Although IDK how you'd go about deriving the rightd0
/d1
for a givenq
register; There might not be any GNU C inline-asm syntax for that on ARM.) – Keptvld1.32 {d0-d3}, [r2]
. Note that the destination elements are delimited with a hyphen, not a comma. Perhaps this is how vector elements are specified? (Note: I don't speak NEON). – Briard[reg]
addressing mode as opposed to a post-increment addressing mode. That's still worse than intrinsics, but this is still a valid question for cases where inline asm is a good choice. – Kept