The following is a minimal reproducible example of code that I had to generate an 'array' (whose 1-byte elements are packed into the resulting uint_fast64_t
) of 3D coordinates within octree branches given x
z
and y
positions:
#include <stdint.h>
void test(uint_fast64_t *const coord, const uint_fast8_t x, const uint_fast8_t z, const uint_fast8_t y) {
static const uint_fast64_t m = 0x2040810204081ULL, a = 0x101010101010101ULL;
*coord = (x * m & a) | (z * m & a) << 1 | (y * m & a) << 2;
}
Looking at the assembly, GCC appears to only generate one 'variant' of the m
constant, but three variants
of the a
constant, including 0x404040404040404
and 0x202020202020202
.
test:
movabs rax, 567382630219905 ; 0x2040810204081
movzx edx, dl
movzx esi, sil
movzx ecx, cl
movabs r8, 144680345676153346 ; 0x202020202020202
imul rdx, rax
imul rsi, rax
imul rcx, rax
movabs rax, 289360691352306692 ; 0x404040404040404
add rdx, rdx
and rdx, r8
movabs r8, 72340172838076673 ; 0x101010101010101
and rsi, r8
sal rcx, 2
or rdx, rsi
and rcx, rax
or rdx, rcx
mov QWORD PTR [rdi], rdx
ret
For whatever reason, GCC appears to be 'constant propagating' the << 1
and << 2
to these masks, and storing them separately, when the same mask could just be used by doing the and
first and then the bitshift. This is what is confusing.
Clang, on the other hand, fully propagates the bitshifts to the constants, so the assembly contains 6 of the 64 bit constants, but no shift operations corresponding to the << 1
and << 2
. This seems to be a speed optimization at the cost of size.
But I am perplexed by the GCC handling. Some of the constants are 'folded' but some are not, and the manner in which they are not providing any perceptible benefit.
My question is:
- Is there, for some obscure reason, some advantage to performing the shift first and the
and
mask later, even at the expense of storing extra constants in the code? - If not, is there some hack or compiler flag I can use to circumvent this, and force GCC to simply
and
first and shift afterwards, to avoid storing those constants?
This is one of those cases when 'The compiler will optimize the code, just forget about it.' does not really work, since this 'optimization' itself is what I find questionable.
I know that 16 bytes of opcode is 'not much' but I am still curious as to why GCC performs this 'optimization' despite seeming like a lose to an untrained eye. And this even happens with aggressive size optimizations.
clang
reports a warning:&' within '|' [-Wbitwise-op-parentheses]
andplace parentheses around the '&' expression to silence this
– Edgebone-Oz
to override that. – Marshya
asvolatile
instead ofstatic const
? – FurlongQWORD PTR
accesses instead of register accesses. Not sure if that is desirable from a performance perspective. (Demo) – Marshyshl
by a constant has a latency of 1 cycle on every relevant and halfway-relevant x64 processor (andand
doesn't take less than a cycle) – Archiepiscopalshl
as 1 cycle on all of those (they're not included in uops.info/table.html sadly) – Archiepiscopallatency
and notupos' cycle
in my really first comment? – Furlonglatency
of an instruction for a specific CPU and hence optimizes the operation on that chip. – Furlongshl
on K10 there's a 1 there (so 1 cycle). And you were talking about latency. In terms of throughput, it is true thatshl
is a little bit more expensive thanand
(on some processors, not K10 specifically) – Archiepiscopal-Os
vs.-O3
doesn't control every optimization choice; I think it just biases some specific choices / heuristics that choose to check on that. Some optimization passes don't change their behaviour for-Os
vs.-O3
, including apparently whatever did this. – Priddy