For what purpose does GCC create separate bitmasks applied to shifted items?
Asked Answered
M

1

6

The following is a minimal reproducible example of code that I had to generate an 'array' (whose 1-byte elements are packed into the resulting uint_fast64_t) of 3D coordinates within octree branches given x z and y positions:

#include <stdint.h>
void test(uint_fast64_t *const coord, const uint_fast8_t x, const uint_fast8_t z, const uint_fast8_t y) {
    static const uint_fast64_t m = 0x2040810204081ULL, a = 0x101010101010101ULL;
    *coord = (x * m & a) | (z * m & a) << 1 | (y * m & a) << 2;
}

Looking at the assembly, GCC appears to only generate one 'variant' of the m constant, but three variants of the a constant, including 0x404040404040404 and 0x202020202020202.

test:
        movabs  rax, 567382630219905 ; 0x2040810204081
        movzx   edx, dl
        movzx   esi, sil
        movzx   ecx, cl
        movabs  r8, 144680345676153346 ; 0x202020202020202
        imul    rdx, rax
        imul    rsi, rax
        imul    rcx, rax
        movabs  rax, 289360691352306692 ; 0x404040404040404
        add     rdx, rdx
        and     rdx, r8
        movabs  r8, 72340172838076673 ; 0x101010101010101
        and     rsi, r8
        sal     rcx, 2
        or      rdx, rsi
        and     rcx, rax
        or      rdx, rcx
        mov     QWORD PTR [rdi], rdx
        ret

For whatever reason, GCC appears to be 'constant propagating' the << 1 and << 2 to these masks, and storing them separately, when the same mask could just be used by doing the and first and then the bitshift. This is what is confusing.

Clang, on the other hand, fully propagates the bitshifts to the constants, so the assembly contains 6 of the 64 bit constants, but no shift operations corresponding to the << 1 and << 2. This seems to be a speed optimization at the cost of size.

But I am perplexed by the GCC handling. Some of the constants are 'folded' but some are not, and the manner in which they are not providing any perceptible benefit.

My question is:

  • Is there, for some obscure reason, some advantage to performing the shift first and the and mask later, even at the expense of storing extra constants in the code?
  • If not, is there some hack or compiler flag I can use to circumvent this, and force GCC to simply and first and shift afterwards, to avoid storing those constants?

This is one of those cases when 'The compiler will optimize the code, just forget about it.' does not really work, since this 'optimization' itself is what I find questionable.

I know that 16 bytes of opcode is 'not much' but I am still curious as to why GCC performs this 'optimization' despite seeming like a lose to an untrained eye. And this even happens with aggressive size optimizations.

Marshy answered 17/1 at 3:38 Comment(14)
You may want to refactor as clang reports a warning: &' within '|' [-Wbitwise-op-parentheses] and place parentheses around the '&' expression to silence thisEdgebone
On most CPUs, bitwise shift operation has high latency with respect to the AND operation. Maybe GCC optimize it due to this reason.Furlong
@ImanAbdollahzadeh But I'd expect aggressive size optimizations -Oz to override that.Marshy
Does it help to make modifier of a as volatile instead of static const?Furlong
@ImanAbdollahzadeh That does prevent more than 2 constants from being stored, but I also notice, unsurprisingly, that there would be a lot of QWORD PTR accesses instead of register accesses. Not sure if that is desirable from a performance perspective. (Demo)Marshy
@ImanAbdollahzadeh what processors are included under "most processors"? Not anything I know, eg shl by a constant has a latency of 1 cycle on every relevant and halfway-relevant x64 processor (and and doesn't take less than a cycle)Archiepiscopal
@harold AMD K8, AMD K10, AMD Bulldozer, AMD Piledriver, Steamroller, ...Furlong
@ImanAbdollahzadeh Why might that be? It's just shifting around some bits.Marshy
@ImanAbdollahzadeh Agner Fog lists shl as 1 cycle on all of those (they're not included in uops.info/table.html sadly)Archiepiscopal
@harold Didn't I talk about latency and not upos' cycle in my really first comment?Furlong
@SupportUkraine The point is that I had a naive guess that GCC might decide to check out the latency of an instruction for a specific CPU and hence optimizes the operation on that chip.Furlong
@ImanAbdollahzadeh I don't know what you're saying now but presumably you know this document, and checking the latency column for shl on K10 there's a 1 there (so 1 cycle). And you were talking about latency. In terms of throughput, it is true that shl is a little bit more expensive than and (on some processors, not K10 specifically)Archiepiscopal
Compilers are complex pieces of machinery that do great things in the big picture, but it's not rare for them to make bad choices in the small scale. And it's also not rare that -Os vs. -O3 doesn't control every optimization choice; I think it just biases some specific choices / heuristics that choose to check on that. Some optimization passes don't change their behaviour for -Os vs. -O3, including apparently whatever did this.Priddy
Note: I also noticed that older versions of GCC produce the 'expected' code: godbolt.org/z/PjqM5xTE4Marshy
T
8

I can only speculate that the GCC code generator is simply programmed to always compute bit masks relative to the final positions, even when it means that the number of bit masks is growing.

GCC has other heuristics as well, like reduction of immediates by 1, when comparing with inequality. if (a < 2) is transformed to if (a <= 1), which does not make sense if one also needs to to compute if (a == 2) for some other use.


One can anyway prevent GCC and clang from taking some optimisations by an optimisation barrier asm("" :"+r"(a)) -- combined with having the constants as non-const variables.

The barrier informs that register containing a is being somehow modified by the asm statement, meaning that a no longer contains the constant. Subsequently a << 1, a << 2 are also no longer immediates derivable from a.

void test(uint_fast64_t *const coord, const uint_fast8_t x, const uint_fast8_t z, const uint_fast8_t y) {
     uint_fast64_t m = 0x2040810204081ULL, a = 0x101010101010101ULL;
     asm("" : "+r"(a));
     uint_fast64_t xm = x * m & a;
     uint_fast64_t ym = y * m & a;
     uint_fast64_t zm = z * m & a;
    *coord = xm | (zm << 1) | (ym << 2);
}

In this particular case one can apparently also use

void test(uint_fast64_t *const coord, const uint_fast8_t x, const uint_fast8_t z, const uint_fast8_t y) {
    static const uint_fast64_t m = 0x2040810204081ULL, a = 0x101010101010101ULL;
    *coord = (x * m & a) + (z * m & a) * 2 + (y * m & a) * 4;
}

for

test:
        movabs  r8, 567382630219905
        movzx   ecx, cl
        movzx   edx, dl
        movabs  rax, 72340172838076673
        imul    rcx, r8
        movzx   esi, sil
        imul    rdx, r8
        imul    rsi, r8
        and     rcx, rax
        add     rcx, rcx
        and     rdx, rax
        add     rcx, rdx
        and     rsi, rax
        add     rcx, rcx
        add     rcx, rsi
        mov     QWORD PTR [rdi], rcx
        ret

In this case I would have actually expected lea rax, [rax + 4*rbx] format to be used, instead of two separate add rcx, rcx to left-shift by 1 as it accumulates in a longer dependency chain than necessary.

Tsui answered 17/1 at 13:53 Comment(3)
I would just guess, that clang/gcc just automatically do constant folding whenever possible as a strategy or heuristics to add more opportunities for instruction level parallelism and shorter dependency chains. Clang just does it "better". We can see that (z * m & a) * 2 == z * (2m) & (2a). Three operations are reduced to two. What seems to be missing is a proper cost model for the generation of the constants -- like in ARM64 0x01010101... can be embedded to many instructions, but 0x804020... requires four mov, movk instructions.Tsui
Is there a way that having these separate bitmasks enables instruction-level parallelism in a way that reusing the same one does not?Marshy
I can't see it immediately -- other than constant generation should ILP with everything. If there are long dependency chains with long latencies and "empty bubbles" in the pipeline, then an OoO scheduler can always put a constant generation there. Personally I'd like to see the constants being reused as intended/written by the programmer.Tsui

© 2022 - 2024 — McMap. All rights reserved.