Missing optimization: mov al, [mem] to bitfield-insert a new low byte into an integer

About

Asked 22/6, 2023 at 10:38 Answered 22/6, 2023 at 18:26

Solved c assembly x86-64 micro-optimization

I want to replace the lowest byte in an integer. On x86 this is exactly mov al, [mem] but I can't seem to get compilers to output this. Am I missing an obvious code pattern that is recognized, am I misunderstanding something, or is this simply a missed optimization?

unsigned insert_1(const unsigned* a, const unsigned char* b)
{
    return (*a & ~255) | *b;
}
unsigned insert_2(const unsigned* a, const unsigned char* b)
{
    return *a >> 8 << 8 | *b;
}

GCC actually uses al but just for zeroing.

        mov     eax, DWORD PTR [rdi]
        movzx   edx, BYTE PTR [rsi]
        xor     al, al
        or      eax, edx
        ret

Clang compiles both practically verbatim

        mov     ecx, -256
        and     ecx, dword ptr [rdi]
        movzx   eax, byte ptr [rsi]
        or      eax, ecx
        ret

Patten answered 22/6, 2023 at 10:38 Comment(10)

@500-InternalServerError that doesn't make sense to me. I want to replace the lowest byte in a 32 bit integer. So the return value is 32 bit – Patten 22/6, 2023 at 11:3

first, do you compile with full optimization? If so, may be mov al, [mem] seems shorter than 3 instructions, but this instruction has low throughput (eax has 2 dependencies for its value) and I think is equivalent in time to the 3 others. – Kronstadt 22/6, 2023 at 11:16

Integer promotion is the law. – Neilson 22/6, 2023 at 11:45

@HansPassant can you explain it sir please? That is, how the int promotion affects the assembly code. – Truth 22/6, 2023 at 12:14

Why doesn't GCC use partial registers? - mov al, [rdi] is indeed optimal on Sandybridge-family and non-Intel CPUs. It would cause a partial-register stall on P6-family. Perhaps nobody's taught GCC/clang to look for that as a peephole optimization for merging a byte because historically it was slow? But GCC actually is doing partial-register shenanigans with xor al,al; that's quite silly. But at least on Sandybridge-family, that doesn't have any extra partial-reg penalty. or eax, edx after that would stall on P6. – Corazoncorban 22/6, 2023 at 12:15

@HansPassant I think the promotion rules are insignificant since we are talking about compiler optimizations, so the as-if rule applies. – Patten 22/6, 2023 at 12:31

@PeterCordes Thanks for confirming what I thought. Though I would have expected gcc to find this optimization at least on -Os. Compilers also don't seem to find a similar BFI instruction on ARM for the same purpose, instead using BIC + ORR – Patten 22/6, 2023 at 12:38

@PeterCordes: mov al, [rdi] does not cause a partial register stall on P6 if you are writing to AL. On P6 the stall is write small read large, e.g. if you then read EAX after writing AL. However, other P6 family members made the trade-off in slightly different ways, stalling not at consumption, but doing the merge at every write to a partial register. in any case, your point is valid, partial register stalls are problematic. // If you are writing AL to memory that it does not cause a partial register stall. It might cause a partial store buffer stall. – Fell 7/7, 2023 at 3:55

@KrazyGlew: I meant it would lead to a partial-reg stall in the caller, when they use the unsigned return value on PPro through Nehalem. On Sandybridge-family mov al, [rdi] would have a true dependency on RAX and do the merge then, as a micro-fused load+ALU uop. (Except maybe on first-gen Sandybridge itself which does still rename AL separately from RAX when the access isn't an RMW.) – Corazoncorban 7/7, 2023 at 12:42

@KrazyGlew: Wait, what's a partial store-buffer stall? Is it not the case that byte stores on x86 (at least Intel P6 / SnB families) are just as efficient as 32-bit stores, in cost per store instruction? (Per my testing on Are there any modern CPUs where a cached byte store is actually slower than a word store?, modern CPUs like Skylake don't show any sign of needing to RMW a whole word to commit a byte store to a line of L1d, unlike most non-x86 CPUs.) – Corazoncorban 7/7, 2023 at 12:47

On x86 this is exactly mov al, [mem] but I can't seem to get compilers to output this.

Try this one, arithmetic-free:

unsigned insert_4(const unsigned* a, const unsigned char* b)
{
    unsigned int t = *a;
    unsigned char *tcp = (unsigned char *) & t;
    tcp[0] = *b;
    return t;
}

insert_4(unsigned int const*, unsigned char const*):
        mov     eax, DWORD PTR [rdi]
        mov     al, BYTE PTR [rsi]
        ret

A bit screwy, I know but the compilers are good at removing indirection and address taken for local variables (took a couple of tries though..).

godbolt x86-64 gcc 13.1 -O3

An alternative using union:

unsigned insert_5(const unsigned* a, const unsigned char* b)
{
    union {
        unsigned int ui;
        unsigned char uc;
    } u;
    u.ui = *a;
    u.uc = *b;
    return u.ui;
}

godbolt x86-64 gcc 13.1 -O3

Note, these solutions are endian-specific, though it seems like what you're looking for, and, as needed can be adjusted for the other endian.

Pinero answered 22/6, 2023 at 18:26 Comment(2)

Huh, that even compiles to BFI on ARM. It sure is screwy but I can't argue with results. – Patten 22/6, 2023 at 18:57

@Homer512, added another solution using union. – Pinero 22/6, 2023 at 19:7

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags