Looping over arrays with inline assembly
Asked Answered
E

3

7

When looping over an array with inline assembly should I use the register modifier "r" or he memory modifier "m"?

Let's consider an example which adds two float arrays x, and y and writes the results to z. Normally I would use intrinsics to do this like this

for(int i=0; i<n/4; i++) {
    __m128 x4 = _mm_load_ps(&x[4*i]);
    __m128 y4 = _mm_load_ps(&y[4*i]);
    __m128 s = _mm_add_ps(x4,y4);
    _mm_store_ps(&z[4*i], s);
}

Here is the inline assembly solution I have come up with using the register modifier "r"

void add_asm1(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%1,%%rax,4), %%xmm0\n"
            "addps    (%2,%%rax,4), %%xmm0\n"
            "movaps   %%xmm0, (%0,%%rax,4)\n"
            :
            : "r" (z), "r" (y), "r" (x), "a" (i)
            :
        );
    }
}

This generates similar assembly to GCC. The main difference is that GCC adds 16 to the index register and uses a scale of 1 whereas the inline-assembly solution adds 4 to the index register and uses a scale of 4.

I was not able to use a general register for the iterator. I had to specify one which in this case was rax. Is there a reason for this?

Here is the solution I came up with using the memory modifer "m"

void add_asm2(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   %1, %%xmm0\n"
            "addps    %2, %%xmm0\n"
            "movaps   %%xmm0, %0\n"
            : "=m" (z[i])
            : "m" (y[i]), "m" (x[i])
            :
            );
    }
}

This is less efficient as it does not use an index register and instead has to add 16 to the base register of each array. The generated assembly is (gcc (Ubuntu 5.2.1-22ubuntu2) with gcc -O3 -S asmtest.c):

.L22
    movaps   (%rsi), %xmm0
    addps    (%rdi), %xmm0
    movaps   %xmm0, (%rdx)
    addl    $4, %eax
    addq    $16, %rdx
    addq    $16, %rsi
    addq    $16, %rdi
    cmpl    %eax, %ecx
    ja      .L22

Is there a better solution using the memory modifier "m"? Is there some way to get it to use an index register? The reason I asked is that it seemed more logical to me to use the memory modifer "m" since I am reading and writing memory. Additionally, with the register modifier "r" I never use an output operand list which seemed odd to me at first.

Maybe there is a better solution than using "r" or "m"?

Here is the full code I used to test this

#include <stdio.h>
#include <x86intrin.h>

#define N 64

void add_intrin(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __m128 x4 = _mm_load_ps(&x[i]);
        __m128 y4 = _mm_load_ps(&y[i]);
        __m128 s = _mm_add_ps(x4,y4);
        _mm_store_ps(&z[i], s);
    }
}

void add_intrin2(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n/4; i++) {
        __m128 x4 = _mm_load_ps(&x[4*i]);
        __m128 y4 = _mm_load_ps(&y[4*i]);
        __m128 s = _mm_add_ps(x4,y4);
        _mm_store_ps(&z[4*i], s);
    }
}

void add_asm1(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%1,%%rax,4), %%xmm0\n"
            "addps    (%2,%%rax,4), %%xmm0\n"
            "movaps   %%xmm0, (%0,%%rax,4)\n"
            :
            : "r" (z), "r" (y), "r" (x), "a" (i)
            :
        );
    }
}

void add_asm2(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   %1, %%xmm0\n"
            "addps    %2, %%xmm0\n"
            "movaps   %%xmm0, %0\n"
            : "=m" (z[i])
            : "m" (y[i]), "m" (x[i])
            :
            );
    }
}

int main(void) {
    float x[N], y[N], z1[N], z2[N], z3[N];
    for(int i=0; i<N; i++) x[i] = 1.0f, y[i] = 2.0f;
    add_intrin2(x,y,z1,N);
    add_asm1(x,y,z2,N);
    add_asm2(x,y,z3,N);
    for(int i=0; i<N; i++) printf("%.0f ", z1[i]); puts("");
    for(int i=0; i<N; i++) printf("%.0f ", z2[i]); puts("");
    for(int i=0; i<N; i++) printf("%.0f ", z3[i]); puts("");
}
Episcopal answered 12/12, 2015 at 19:46 Comment(9)
As to why you need to use "a" instead of "r": It's because 'i' is an int, so gcc generates eax (the correct size of an int) instead of rax (needed when computing 64bit offsets). You can change i to long long or use %q3 to force the full register. BTW, since add_asm1 modifies memory, it should use the memory clobber.Woods
@DavidWohlferd, thank you for your comments. Especially the one about "memory". Maybe I was not clear though. What I mean is I want to do (%1,%4,4) instead of instead of (%1,%%rax,4) where %4 is whatever register gcc decides rarther than forcing it to be rax.Episcopal
I don't think you want %4, you want %3 (zero based). And if you change int i=0 to long long i=0, then you can use "r" along with %3. Alternately, you can leave i an int, and use %q3 (also changing from "a" to "r").Woods
@DavidWohlferd, your right, I want %3. I tried this and it works I did not even have to switch to long lont i=0. Looking at the assembly I see that gcc uses %eax. That's a better solution anyways as there is no reason to use %rax for the index. If you want to write up an answer I will upvote you.Episcopal
@DavidWohlferd, as to "memory". This link says "If our instruction modifies memory in an unpredictable fashion, add "memory" to the list of clobbered registers". I am not sure how this is unpredictable. However, GCC's documentation says "The "memory" clobber tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands (for example, accessing the memory pointed to by one of the input parameters)."Episcopal
So based on the GCC documentation this applies to my case.Episcopal
It is "unpredictable" in that without parsing the assembler template (which gcc does not do except to replace tokens), there is no way to know whether you are reading and writing memory based solely on the input and outputs you provide.Woods
re: memory clobber: They mean use memory if you can't tell the compiler which memory was clobbered. In this case it is predictable, so you can use the statement-expression trick suggested at the end of the Clobbers section: {"m"( ({ struct { char x[16]; } *p = (void *)(z+i*4) ; *p; }) )}. I modified the example to fit your code: clobber 16 bytes at &z[i*4]. Also note that using a memory output operand would mean you don't need __volatile__ on your asm, since it knows it can't hoist a store to z[i].Golightly
@PeterCordes if you give an answer with your suggestion using {"m"( ({ struct { char x[16]; } *p = (void *)(z+i*4) ; *p; }) )} I will upvote it.Episcopal
G
7

Avoid inline asm whenever possible: https://gcc.gnu.org/wiki/DontUseInlineAsm. It blocks many optimizations. But if you really can't hand-hold the compiler into making the asm you want, you should probably write your whole loop in asm so you can unroll and tweak it manually, instead of doing stuff like this.


You can use an r constraint for the index. Use the q modifier to get the name of the 64bit register, so you can use it in an addressing mode. When compiled for 32bit targets, the q modifier selects the name of the 32bit register, so the same code still works.

If you want to choose what kind of addressing mode is used, you'll need to do it yourself, using pointer operands with r constraints.

GNU C inline asm syntax doesn't assume that you read or write memory pointed to by pointer operands. (e.g. maybe you're using an inline-asm and on the pointer value). So you need to do something with either a "memory" clobber or memory input/output operands to let it know what memory you modify. A "memory" clobber is easy, but forces everything except locals to be spilled/reloaded. See the Clobbers section in the docs for an example of using a dummy input operand.

Specifically, a "m" (*(const float (*)[]) fptr) will tell the compiler that the entire array object is an input, arbitrary-length. i.e. the asm can't reorder with any stores that use fptr as part of the address (or that use the array it's known to point into). Also works with an "=m" or "+m" constraint (without the const, obviously).

Using a specific size like "m" (*(const float (*)[4]) fptr) lets you tell the compiler what you do/don't read. (Or write). Then it can (if otherwise permitted) sink a store to a later element past the asm statement, and combine it with another store (or do dead-store elimination) of any stores that your inline asm doesn't read.

(See How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for a whole Q&A about this.)


Another huge benefit to an m constraint is that -funroll-loops can work by generating addresses with constant offsets. Doing the addressing ourself prevents the compiler from doing a single increment every 4 iterations or something, because every source-level value of i needs to appear in a register.


Here's my version, with some tweaks as noted in comments. This is not optimal, e.g. can't be unrolled efficiently by the compiler.

#include <immintrin.h>
void add_asm1_memclobber(float *x, float *y, float *z, unsigned n) {
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%[y],%q[idx],4), %[vectmp]\n\t"  // q modifier: 64bit version of a GP reg
            "addps    (%[x],%q[idx],4), %[vectmp]\n\t"
            "movaps   %[vectmp], (%[z],%q[idx],4)\n\t"
            : [vectmp] "=x" (vectmp)  // "=m" (z[i])  // gives worse code if the compiler prepares a reg we don't use
            : [z] "r" (z), [y] "r" (y), [x] "r" (x),
              [idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
            : "memory"
          // you can avoid a "memory" clobber with dummy input/output operands
        );
    }
}

Godbolt compiler explorer asm output for this and a couple versions below.

Your version needs to declare %xmm0 as clobbered, or you will have a bad time when this is inlined. My version uses a temporary variable as an output-only operand that's never used. This gives the compiler full freedom for register allocation.

If you want to avoid the "memory" clobber, you can use dummy memory input/output operands like "m" (*(const __m128*)&x[i]) to tell the compiler which memory is read and written by your function. This is necessary to ensure correct code-generation if you did something like x[4] = 1.0; right before running that loop. (And even if you didn't write something that simple, inlining and constant propagation can boil it down to that.) And also to make sure the compiler doesn't read from z[] before the loop runs.

In this case, we get horrible results: gcc5.x actually increments 3 extra pointers because it decides to use [reg] addressing modes instead of indexed. It doesn't know that the inline asm never actually references those memory operands using the addressing mode created by the constraint!

# gcc5.4 with dummy constraints like "=m" (*(__m128*)&z[i]) instead of "memory" clobber
.L11:
    movaps   (%rsi,%rax,4), %xmm0   # y, i, vectmp
    addps    (%rdi,%rax,4), %xmm0   # x, i, vectmp
    movaps   %xmm0, (%rdx,%rax,4)   # vectmp, z, i

    addl    $4, %eax        #, i
    addq    $16, %r10       #, ivtmp.19
    addq    $16, %r9        #, ivtmp.21
    addq    $16, %r8        #, ivtmp.22
    cmpl    %eax, %ecx      # i, n
    ja      .L11        #,

r8, r9, and r10 are the extra pointers that the inline asm block doesn't use.

You can use a constraint that tells gcc an entire array of arbitrary length is an input or an output: "m" (*(const char (*)[]) pStr). This casts the pointer to a pointer-to-array (of unspecified size). See How can I indicate that the memory *pointed* to by an inline ASM argument may be used?

If we want to use indexed addressing modes, we will have the base address of all three arrays in registers, and this form of constraint asks for the base address (of the whole array) as an operand, rather than a pointer to the current memory being operated on.

This actually works without any extra pointer or counter increments inside the loop: (avoiding a "memory" clobber, but still not easily unrollable by the compiler).

void add_asm1_dummy_whole_array(const float *restrict x, const float *restrict y,
                             float *restrict z, unsigned n) {
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%[y],%q[idx],4), %[vectmp]\n\t"  // q modifier: 64bit version of a GP reg
            "addps    (%[x],%q[idx],4), %[vectmp]\n\t"
            "movaps   %[vectmp], (%[z],%q[idx],4)\n\t"
            : [vectmp] "=x" (vectmp)
             , "=m" (*(float (*)[]) z)  // "=m" (z[i])  // gives worse code if the compiler prepares a reg we don't use
            : [z] "r" (z), [y] "r" (y), [x] "r" (x),
              [idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
              , "m" (*(const float (*)[]) x),
                "m" (*(const float (*)[]) y)  // pointer to unsized array = all memory from this pointer
        );
    }
}

This gives us the same inner loop we got with a "memory" clobber:

.L19:   # with clobbers like "m" (*(const struct {float a; float x[];} *) y)
    movaps   (%rsi,%rax,4), %xmm0   # y, i, vectmp
    addps    (%rdi,%rax,4), %xmm0   # x, i, vectmp
    movaps   %xmm0, (%rdx,%rax,4)   # vectmp, z, i

    addl    $4, %eax        #, i
    cmpl    %eax, %ecx      # i, n
    ja      .L19        #,

It tells the compiler that each asm block reads or writes the entire arrays, so it may unnecessarily stop it from interleaving with other code (e.g. after fully unrolling with low iteration count). It doesn't stop unrolling, but the requirement to have each index value in a register does make it less effective. There's no way for this to end up with a 16(%rsi,%rax,4) addressing mode in a 2nd copy of this block in the same loop, because we're hiding the addressing from the compiler.


A version with m constraints, that gcc can unroll:

#include <immintrin.h>
void add_asm1(float *x, float *y, float *z, unsigned n) {
    // x, y, z are assumed to be aligned
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
           // "movaps   %[yi], %[vectmp]\n\t"   // get the compiler to do this load instead
            "addps    %[xi], %[vectmp]\n\t"
            "movaps   %[vectmp], %[zi]\n\t"
          // __m128 is a may_alias type so these casts are safe.
            : [vectmp] "=x" (vectmp)         // let compiler pick a stratch reg
              ,[zi] "=m" (*(__m128*)&z[i])   // actual memory output for the movaps store
            : [yi] "0"  (*(__m128*)&y[i])  // or [yi] "xm" (*(__m128*)&y[i]), and uncomment the movaps load
             ,[xi] "xm" (*(__m128*)&x[i])
              //, [idx] "r" (i) // unrolling with this would need an insn for every increment by 4
        );
    }
}

Using [yi] as a +x input/output operand would be simpler, but writing it this way makes a smaller change for uncommenting the load in the inline asm, instead of letting the compiler get one value into registers for us.

Golightly answered 24/12, 2015 at 0:36 Comment(8)
This is the kind of answer I was looking for. Thank you.Episcopal
@Zboson: check the update. I wrote a couple other inline-asm answers recently (stackoverflow.com/questions/34449407/… and https://mcmap.net/q/14654/-llvm-reports-unsupported-inline-asm-input-with-type-39-void-39-matching-output-with-type-39-int-39/224132), and using a C temporary as an output-only operand is a great way to give the compiler control of register allocation for scratch registers. More importantly, an m constraint allows unrolling. I think that's another major downside of choosing the addressing mode yourself.Golightly
@Zboson: glad that helped. Your question kind of wandered around a bit, so it wasn't until you complained to Jester and I re-read the beginning of your question that I really realized which part hadn't been answered. It would be a better question if you edited it down to more clearly state your goals, and just show one intrinsics attempt that leads to code with more uops as the reason why you don't want that. And I guess it's a 2-part question, since you didn't find the q modifier to let you use a 32bit loop counter operand as an index register.Golightly
I did not realize my question was ambiguous .I should not have included intrinsics and instead only have used inline assembly. My question is only about inline assembly. The very first sentence of my question is in bold "When looping over an array with inline assembly should I use the register modifier "r" or he memory modifier "m"?" The intrinsics are mostly there for unit testing with the assembly.Episcopal
@Zboson: I have the same tendency, to go overboard with related stuff I'm thinking about and alternatives I've already considered. I did notice that the actual question was the stuff in bold the 2nd time around, but it wasn't a very specific question so I think I was still looking for more "question" in the rest of it. That's always the problem when you're asking about something you don't really understand. (And if you did, you wouldn't be asking in the first place!) But anyway, hope this helps for future questions. Maybe summarize at the end, too? Or make it clear what the summary is.Golightly
I often set the summary apart with a tl;dr summary and a horizontal line, so make sure it's clear that the part inside that section is the important part, and everything outside is just that writ large, or with more background detail. To summarize my point about your question, the biggest problem was you didn't define your exact goal. There are reasons either way for using m or r, depending on what you're aiming for.Golightly
@PeterCordes in your unroll godbolt loop you added a "memory" clobber. Why was "=m" (*(__m128*)&z[i]) not enough? Seems to be exactly the same as the example gcc provides asm ("vecmul %0, %1, %2" : "+r" (z), "+r" (x), "+r" (y), "=m" (*z) : "m" (*x), "m" (*y)); in the extended assembly manpageFelisha
@Noah: Thanks, that looks like leftover lines I missed taking out when modifying the first version to create that separate example. There are a few things I'd been meaning to edit in this answer. Could probably rip out large chunks now that I updated the syntax of the example code blocks to use "m" (*(const float (*)[]) x) instead of the pointer-to-struct with arbitrary-size array member, and there's another Q&A about how to tell the compiler about pointed-to memory that I could just link. But that's a project for another day.Golightly
O
2

When I compile your add_asm2 code with gcc (4.9.2) I get:

add_asm2:
.LFB0:
        .cfi_startproc
        xorl        %eax, %eax
        xorl        %r8d, %r8d
        testl       %ecx, %ecx
        je  .L1
        .p2align 4,,10
        .p2align 3
.L5:
#APP
# 3 "add_asm2.c" 1
        movaps   (%rsi,%rax), %xmm0
addps    (%rdi,%rax), %xmm0
movaps   %xmm0, (%rdx,%rax)

# 0 "" 2
#NO_APP
        addl        $4, %r8d
        addq        $16, %rax
        cmpl        %r8d, %ecx
        ja  .L5
.L1:
        rep; ret
        .cfi_endproc

so it is not perfect (it uses a redundant register), but does use indexed loads...

Officiate answered 12/12, 2015 at 20:33 Comment(9)
Interesting, gcc (Ubuntu 5.2.1-22ubuntu2) does not do this (I added the assembly output to my question if you want to see it). Your result is the same assembly as my add_intrin function. That's why I used add_intrin2. It does not use a redundant register.Episcopal
Why is GCC 5.2.1 less efficient than 4.9.2 in this case?Episcopal
@Zboson: I assume they changed something in their function that evaluates the "cost" of using various addressing modes. This isn't the first time I've seen gcc5 do address calculations itself instead of using a reg+reg*scale addressing mode. Notice that gcc 5.3 doesn't use a scaled addressing mode for either intrinsic function either, even keeping two separate loop counters for add_intrin. IDK why it doesn't do the same thing for the asm memory operands and use 2-reg addressing modes. Maybe it thinks it can't use the same regs for multiple operands?Golightly
Remember that gcc does most of its work on an intermediate representation of the code, independent of the target architecture. I tried to google about gcc avoiding scaled addressing modes, but only found this mail message where it's discussed a couple years ago: patchwork.ozlabs.org/patch/278187Golightly
@PeterCordes, I check add_asm2 on godbolt for GCC 5.3 and GCC 4.9.2 and 4.9.2 gives the result of this answer so that confirms this again. This has convinced me that using the memory modifier "m" is not the right solution. It's disturbing to have such different results with different version of GCC.Episcopal
@Zboson: In an ideal world, it's the "right answer", but gcc does a poor job with it. That looks like a bug which should be reported. If you want to choose the addressing mode yourself, then sure, go for it, esp because of the gcc5 bug or whatever it is that makes it choose poorly.Golightly
@PeterCordes, why is the right answer in an ideal world? I mean why should the "m" modifier be better than "r"? For that matter I don't even see the point of "m" since I can use "r". The only point I see to "m" is when you want to access memory without using a register and the only cases I can think of for that are absolute addressing and RIP relative. Maybe that's the point of "m"?Episcopal
@Zboson: It leaves the addressing mode decision-making up to the compiler. Maybe after inlining, the compiler can use a simpler addressing mode (like [disp32 + index] if run on a static array). Maybe the caller did add_asm1(buf1+64, buf2+128, outbuf), so gcc can use a [disp8 + buf1 + index] addressing mode instead of using an extra instruction to get buf1+64 into a register.Golightly
@PeterCordes, thanks, I think I get it now. I don't want to give more freedom to the compiler when I use inline assembly. I want it to do what I think is best not when it thinks is best.Episcopal
A
2

gcc also has builtin vector extensions which are even cross platform:

typedef float v4sf __attribute__((vector_size(16)));
void add_vector(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n/4; i+=1) {
        *(v4sf*)(z + 4*i) = *(v4sf*)(x + 4*i) + *(v4sf*)(y + 4*i);
    }
}

On my gcc version 4.7.2 the generated assembly is:

.L28:
        movaps  (%rdi,%rax), %xmm0
        addps   (%rsi,%rax), %xmm0
        movaps  %xmm0, (%rdx,%rax)
        addq    $16, %rax
        cmpq    %rcx, %rax
        jne     .L28
Alceste answered 23/12, 2015 at 14:57 Comment(6)
I am aware of built-in vector extensions as well as intrinsics. My question is about inline assembly. My question is not about why inline assembly is not necessary.Episcopal
Well, you didn't mention that in your post, so I thought it was a good idea to mention vector extensions. Also, we are not just trying to help you, but possible future visitors, who might not know this.Alceste
Did you answer a different question? My question is titled "Looping over arrays with inline assembly". Everything about my question is about inline-assembly. I put in bold "Is there a better solution using the memory modifier "m"? Is there some way to get it to use an index register?". I also wrote "Maybe there is a better solution than using "r" or "m"?" I think it's quite clear that's referring to other methods using inline assembly.Episcopal
Nevertheless it has 2 versions with intrinsics for comparison and doesn't have the vector version. Maybe it will help somebody else.Alceste
Since you are one of the few people with the gold assembly badge I would really like your help with inline assembly. Inline assembly is a dying art due to things like intrinsics and vector extensions and I don't have much experience with it because of this.Episcopal
@Zboson: What exactly do you still want to know on this question that didn't get covered in comments? I thought we established that if you want to choose what addressing mode the code uses, you should use constraints to pass addresses in registers. Then tell the compiler which 16 or 32B of memory is clobbered. I don't think there is a constraint that requires an indexed addressing mode on x86.Golightly

© 2022 - 2024 — McMap. All rights reserved.