GCC code that seems to break inline assembly rules but an expert believes otherwise
Asked Answered
C

1

12

I was engaged with an expert who allegedly has vastly superior coding skills than myself who understands inline assembly far better than I ever could.

One of the claims is that as long as an operand appears as an input constraint, you don't need to list it as a clobber or specify that the register has been potentially modified by the inline assembly. The conversation came about when someone else was trying to get assistance on a memset implementation that was effectively coded this way:

void *memset(void *dest, int value, size_t count)
{
    asm volatile  ("cld; rep stosb" :: "D"(dest), "c"(count), "a"(value));
    return dest;
}

The expert's claim when I commented about the issue with clobbering registers without telling the compiler, was to tell us that:

"c"(count) already tells the compiler c is clobbered

I found an example in the expert's own operating system where they write similar code with the same design pattern. They use Intel syntax for their inline assembly. This hobby operating system code operates in a kernel (ring0) context. An example is this buffer swap function1:

void swap_vbufs(void) {
    asm volatile (
        "1: "
        "lodsd;"
        "cmp eax, dword ptr ds:[rbx];"
        "jne 2f;"
        "add rdi, 4;"
        "jmp 3f;"
        "2: "
        "stosd;"
        "3: "
        "add rbx, 4;"
        "dec rcx;"
        "jnz 1b;"
        :
        : "S" (antibuffer0),
          "D" (framebuffer),
          "b" (antibuffer1),
          "c" ((vbe_pitch / sizeof(uint32_t)) * vbe_height)
        : "rax"
    );

    return;
}

antibuffer0, antibuffer1, and framebuffer are all buffers in memory treated as arrays of uint32_t. framebuffer is actual video memory (MMIO) and antibuffer0, antibuffer1 are buffers allocated in memory.

The global variables are properly set up before this function is called. They are declared as:

volatile uint32_t *framebuffer;
volatile uint32_t *antibuffer0;
volatile uint32_t *antibuffer1;

int vbe_width = 1024;
int vbe_height = 768;
int vbe_pitch;

My Questions and Concerns about this Kind of Code

As an apparent neophyte to inline assembly having an apparent naive understanding of the subject, I'm wondering whether my apparent uneducated belief this code is potentially very buggy is correct. I want to know if these concerns have any merit:

  1. RDI, RSI, RBX, and RCX are all modified by this code. RDI and RSI are incremented by LODSD and STOSD implicitly. The rest are modified explicitly with

        "add rbx, 4;"
        "dec rcx;"
    

    None of these registers are listed as input/output nor are they listed as output operands. I believe these constraints need to be modified to inform the compiler that these registers may have been modified/clobbered. The only register that is listed as clobbered which I believe is correct is RAX. Is my understanding correct? My feeling is that RDI, RSI, RBX, and RCX should be input/output constraints (Using the + modifier). Even if one tries to argue that the 64-bit System V ABI calling convention will save them (assumptions that a poor way IMHO to write such code) RBX is a non-volatile register that will change in this code.

  2. Since the addresses are passed via registers (and not memory constraints), I believe it is a potential bug that the compiler hasn't been told that memory that these pointers are pointing at has been read and/or modified. Is my understanding correct?

  3. RBX, and RCX are hard coded registers. Wouldn't it make sense to allow the compiler to choose these registers automatically via the constraints?

  4. If one assumes that inline assembly has to be used here (hypothetically) what would bug free GCC inline assembly code look like for this function? Is this function fine as is, and I just don't understand the basics of GCC's extended inline assembly like the expert does?


Footnotes

  • 1The swap_vbufs function and associated variable declarations have been reproduced verbatim without the copyright holder's permission under fair use for purposes of commentary about a larger body of work.
Clitoris answered 15/5, 2019 at 5:48 Comment(12)
From the docs: Do not modify the contents of input-only operands (except for inputs tied to outputs). The compiler assumes that on exit from the asm statement these operands contain the same values as they had before executing the statement. It is not possible to use clobbers to inform the compiler that the values in these inputs are changing. Pretty unambiguous.Pyriform
@DavidWohlferd : Oh I don't think it is fair to quote the docs. I have to assume the author of this code is beyond reproach and knows something the GCC writer's don't lol. I watches this individual try to explain how similar memset code should be handled with a nearly identical type of solution. I was informed that I was retarded because I don't have a basic grasp of inline assembly. It must be assumed I (and the documentation) are wrong in the matter.Clitoris
@DavidWohlferd : You might be happy to know that just before being punted off the server where this kind of code was being encouraged I happened to post a link to your article: gcc.gnu.org/wiki/DontUseInlineAsm ;-)Clitoris
I may have understated my understanding of writing code or my understanding of inline assembly when posting my question. I felt it prudent in the face of an author with alleged superior knowledge and understanding - to believe that I must be a neophyte when looking at this kind of thing.Clitoris
@MichaelPetch: yeah very much gcc.gnu.org/wiki/DontUseInlineAsm here; swap_vbufs doesn't look optimal. I think it's just a copy-if-different. I guess if diffs are rare, the branch misses here might be worth avoiding MMIO stores, but a rep movsd or movnti unconditional streaming store to video RAM might be fine. Either way, it's not the most efficient implementation of the loop; using stosd makes it worse because it requires an else add rdi,4 branch with an extra jmp on the (hopefully) fast path.Rhodolite
I get that some people might think they're smarter than the docs (some are). But even if that were true, it's important to remember that the docs don't just constrain users, they also enable the compiler writers. If they know that X must be true, they can perform additional optimizations at any time. So even if it were safe today (and per Peter, clearly it isn't) that doesn't mean it's safe forever. Will he still be maintaining this code then? Or does some other poor schmuck get stuck tracking down the problems violating this rule engenders?Pyriform
@DavidWohlferd : I'm in complete agreement with you and Peter on this. Right from the outset the obvious flaws (even to my apparent neophyte understanding) in the memset function I pointed out were dismissed and I was the retard because I am the one who apparently doesn't understand. There is my belief, yours, peter's and the expert. When it comes to inline assembly I'm pretty anal about it being done correctly because it can be the source of hard to track bugs.Clitoris
If the code gets shared the situation gets worse. Just because something is observed to work by luck alone, doesn't mean it is behaviour that should be relied upon.Clitoris
"work by luck alone" - In this case, it's clear that's what's happening. But the problem is that ANY asm might be working only due to luck. Unless/until it fails, how would you know that you missed an early clobber or some such? Examining the output just isn't sufficient to ensure correctness. And yeah, there's a lot of badly written asm out there which people who don't understand it just copy. Even if the original author cleans it up, everyone else is still screwed. It sounds like you've got a good understand of the how and when though.Pyriform
As for DontUseInlineAsm, I don't know that there's any great wisdom there, but there might be things that people who don't use asm often haven't fully considered. I fell in love with asm when I first discovered it. But over time, I came to understand that despite its power, it's almost always a bad idea. It gives interesting insight into how the compiler works, but I'd think long and hard before using it in production code. That said, I mentally add 1 to my karma every time it gets referenced...Pyriform
@DavidWohlferd Well I mention it a lot on SO and off so you have a pile of positive karma from me alone ;-)Clitoris
@MichaelPetch I'm glad it's useful. If you think of any additional points that belong there or have any corrections, let me know.Pyriform
R
13

You are correct on all counts, this code is full of lies to the compiler that could bite you. e.g. with different surrounding code, or different compiler versions / options (especially link-time optimization to enable cross-file inlining).

swap_vbufs doesn't even look very efficient, I suspect gcc would do equal or better with a pure C version. https://gcc.gnu.org/wiki/DontUseInlineAsm. stosd is 3 uops on Intel, worse than a regular mov-store + add rdi,4. And making add rdi,4 unconditional would avoid the need for that else block which puts an extra jmp on the (hopefully) fast path where there's no MMIO store to video RAM because the buffers were equal.

(lodsd is only 2 uops on Haswell and newer so that's ok if you don't care about IvyBridge or older).

In kernel code I guess they're avoiding SSE2, even though it's baseline for x86-64, otherwise you'd probably want to use that. For a normal memory destination, you'd just memcpy with rep movsd or ERMSB rep movsb, but I guess the point here is to avoid MMIO stores when possible by checking against a cached copy of video RAM. Still, unconditional streaming stores with movnti might be efficient, unless video RAM is mapped UC (uncacheable) instead of WC.


It's easy to construct examples where this really does break in practice, by e.g. using the relevant C variable again after the inline asm statement in the same function. (Or in a parent function which inlined the asm).

An input you want to destroy has to be handled usually with a matching dummy output or a RMW output with a C tmp var, not just "r". or "a".

"r" or any specific-register constraint like "D" means this is a read-only input, and the compiler can expect to find the value undisturbed afterwards. There is no "input I want to destroy" constraint; you have to synthesize that with a dummy output or variable.

This all applies to other compilers (clang and ICC) that support GNU C inline asm syntax.

From the GCC manual: Extended asm Input Operands:

Do not modify the contents of input-only operands (except for inputs tied to outputs). The compiler assumes that on exit from the asm statement these operands contain the same values as they had before executing the statement. It is not possible to use clobbers to inform the compiler that the values in these inputs are changing.

(An rax clobber makes it an error to use "a" as an input; clobbers and operands can't overlap.)


Example for 1: register input operands

int plain_C(int in) {   return (in+1) + in;  }

// buggy: modifies an input read-only operand
int bad_asm(int in) {
    int out;
    asm ("inc %%edi;\n\t mov %%edi, %0" : "=a"(out) : [in]"D"(in) );
    return out + in;
}

Compiled on the Godbolt compiler explorer

Notice that gcc's addl uses edi for in, even though inline asm used that register as an input. (And thus breaks because this buggy inline asm modifies the register). It happens to hold in+1 in this case. I used gcc9.1, but this is not new behaviour.

## gcc9.1 -O3 -fverbose-asm
bad(int):
        inc %edi;
         mov %edi, %eax         # out  (comment mentions out because I used %0)

        addl    %edi, %eax      # in, tmp86
        ret     

We fix that by telling the compiler that the same input register is also an output, so it can no longer count on that . (Or by using auto tmp = in; asm("..." : "+r"(tmp));)

int safe(int in) {
    int out;
    int dummy;
    asm ("inc %%edi;\n\t mov %%edi, %%eax"
     : "=a"(out),
       "=&D"(dummy)
     : [in]"1"(in)  // matching constraint, or "D" works.
    );
    return out + in;
}
# gcc9.1 again.
safe_asm(int):
        movl    %edi, %edx      # tmp89, in    compiler-generated save of in
          # start inline asm
        inc %edi;
         mov %edi, %eax
          # end inline asm
        addl    %edx, %eax      # in, tmp88
        ret

Obviously "lea 1(%%rdi), %0" would avoid the problems by not modifying the input in the first place, and so would mov/inc. This is an artificial example that intentionally destroys an input.


If the function does not inline and doesn't use the input variable after the asm statement, you typically get away with lying to the compiler, as long as it's a call-clobbered register.

It's not rare to find people that have written unsafe code that happens to work in the context they're using it in. It's also not rare for them to be convinced that simply testing it in that context with one compiler version/options is sufficient to verify its safety or correctness.

But that's not how asm works; the compiler trusts you to accurately describe the asm's behaviour, and simply does text substitution on the template part.

It would be a crappy missed optimization if gcc assumed that asm statements always destroyed their inputs. In fact, the same constraints that inline asm uses are (I think) used in the internal machine-description files that teach gcc about an ISA. (So destroyed inputs would be terrible for code-gen).

The whole design of GNU C inline asm is based around wrapping a single instruction, that's why even early-clobber for outputs isn't the default. You have to do that manually if necessary, if writing multiple instructions or a loop inside inline asm.


a potential bug that the compiler hasn't been told that memory that these pointers are pointing at has been read and or modified.

That's also correct. A register input operand does not imply that the pointed-to memory is also an input operand. In a function that can't inline, this can't actually cause problems, but as soon as you enable link-time optimization, cross-file inlining and inter-procedural optimization becomes possible.

There's an existing Informing clang that inline assembly reads a particular region of memory unanswered question. This Godbolt link shows some of the ways you can reveal this problem, e.g.

   arr[2] = 1;
   asm(...);
   arr[2] = 0;

If gcc assumes arr[2] isn't an input to the asm, only the arr address itself, it will do dead-store elimination and remove the =1 assignment. (Or look at it as reordering the store with the asm statement, then collapsing 2 stores to the same location).

An array is good because it shows that even "m"(*arr) doesn't work for a pointer, only an actual array. That input operand would only tell the compiler that arr[0] is an input, still not arr[2]. That's a good thing if that's all your asm reads, because it doesn't block optimization of other parts.

For that memset example, to properly declare that the pointed-to memory is an output operand, cast the pointer to a pointer-to-array and dereference it, to tell gcc that an entire range of memory is the operand. *(char (*)[count])pointer. (You can leave the [] empty to specify an arbitrary-length region of memory accessed via this pointer.)

// correct version written by @MichaelPetch.  
void *memset(void *dest, int value, size_t count)
{
  void *tmp = dest;
  asm ("rep stosb    # mem output is %2"
     : "+D"(tmp), "+c"(count),       // tell the compiler we modify the regs
       "=m"(*(char (*)[count])tmp)   // dummy memory output
     : "a"(value)                    // EAX actually is read-only
     : // no clobbers
  );
  return dest;
}

Including an asm comment using the dummy operand lets us see how the compiler allocated it. We can see the compiler picks (%rdi) with AT&T syntax, so it is willing to use a register that's also an input/output operand.

With an early-clobber on the output it might have wanted to use another register, but without that it doesn't cost us anything to gain correctness.

With a void function that doesn't return the pointer (or after inlining into a function that doesn't use the return value), it doesn't have to copy the pointer arg anywhere before letting rep stosb destroy it.

Rhodolite answered 15/5, 2019 at 8:9 Comment(10)
I'm the upvote. I agree with you! In the absence of any other answer in the next day or so I will accept this as an answer. Are you sure about this though? I was told I was retarded and don't understand inline assembly when the author tried to explain memset could be handled in a similar fashion although at some point they believed a "memory" clobber should be added lol.Clitoris
@MichaelPetch: yes, I'm 100% sure. I was only 99.999% after someone cast some doubt on it, so this answer was a placeholder until I got around to creating a test-case to restore my confidence to absolute. :PRhodolite
Thanks, I wish I could upvote again, but I am more than happy to accept this as an answer. DavidWohlferd quoted the GCC docs about clobbering input only operands. Could you maybe quote the docs about that? Just to make that part complete? I appreciate the time you took. The original convo that got this started was about memset that someone wrote in a similar style. I was shot down when I tried to explain some of the very issues and that I lacked inline assembly understanding and that I was a retard lol.Clitoris
@MichaelPetch: yeah that's probably a good idea to make this answer even more authoritative. Done.Rhodolite
PS: I was damn sure I wasn't wrong, but I had to downplay my knowledge on the matter.Clitoris
As always, a thorough and thoughtful explanation from you. It effectively relays the sentiment I had about some similar memset code a third developer asked about. I was shot down about the clobber issue. I updated my question with the memset code that started this. The third party who was intent on using inline assembly for this, so I suggested this at the end of the night: void *memset(void *dest, int value, size_t count) { void *tmp = dest; asm ("rep stosb" : "+D"(tmp), "+c"(count), "=m"(*(char (*)[count])tmp) : "a"(value)); return dest; }Clitoris
I believe you and I are in agreement on all this. I felt like I may have been in an alternate reality yesterday where inline assembly didn't work the way I have learned to expect over the past number of years. The interesting thing about this question/answer is it's about the breaking of a variety of rules that seem to catch people off guard (in SO questions) and then they end up wondering why their code acts strangely in the future for no apparent reason. They may have got lucky it appeared to work (or that the side effects weren't noticeable)Clitoris
@MichaelPetch: yeah, as non-inline functions all of the problems would be hidden. Even destroying RBX, because save/restore of the caller's value doesn't depend on how it's used as a temporary. Like I said in this answer, it's not rare to see people get wrong ideas because their code happens to work. (And it's also not rare for broken code like this to work fine without cross-file inlining enabled; LTO is unfortunately under-used.)Rhodolite
in 64-bit Windows kernel you can use SSE registers without problem. In other cases you'll need to call KeSaveExtendedProcessorState and KeRestoreExtendedProcessorState to save and restore the context. The equivalent on Linux is kernel_fpu_begin()/ kernel_fpu_end(). If the cost of using SIMD registers outweighs the cost of context switch then you can definitely use itEngle
"In fact, the same constraints that inline asm uses are (I think) used in the internal machine-description files that teach gcc about an ISA." True. In fact, the same file is #included in both the user docs and the internals. I added some #ifs to exclude things that don't apply in user space, but it's easy to see the shared text.Pyriform

© 2022 - 2024 — McMap. All rights reserved.