I want to replace the lowest byte in an integer. On x86 this is exactly mov al, [mem]
but I can't seem to get compilers to output this. Am I missing an obvious code pattern that is recognized, am I misunderstanding something, or is this simply a missed optimization?
unsigned insert_1(const unsigned* a, const unsigned char* b)
{
return (*a & ~255) | *b;
}
unsigned insert_2(const unsigned* a, const unsigned char* b)
{
return *a >> 8 << 8 | *b;
}
GCC actually uses al
but just for zeroing.
mov eax, DWORD PTR [rdi]
movzx edx, BYTE PTR [rsi]
xor al, al
or eax, edx
ret
Clang compiles both practically verbatim
mov ecx, -256
and ecx, dword ptr [rdi]
movzx eax, byte ptr [rsi]
or eax, ecx
ret
mov al, [mem]
seems shorter than 3 instructions, but this instruction has low throughput (eax has 2 dependencies for its value) and I think is equivalent in time to the 3 others. – Kronstadtmov al, [rdi]
is indeed optimal on Sandybridge-family and non-Intel CPUs. It would cause a partial-register stall on P6-family. Perhaps nobody's taught GCC/clang to look for that as a peephole optimization for merging a byte because historically it was slow? But GCC actually is doing partial-register shenanigans withxor al,al
; that's quite silly. But at least on Sandybridge-family, that doesn't have any extra partial-reg penalty.or eax, edx
after that would stall on P6. – Corazoncorban-Os
. Compilers also don't seem to find a similarBFI
instruction on ARM for the same purpose, instead usingBIC
+ORR
– Pattenunsigned
return value on PPro through Nehalem. On Sandybridge-familymov al, [rdi]
would have a true dependency on RAX and do the merge then, as a micro-fused load+ALU uop. (Except maybe on first-gen Sandybridge itself which does still rename AL separately from RAX when the access isn't an RMW.) – Corazoncorban