Array initialization optimization
Asked Answered
L

1

33

When compiling the following code snippet (clang x86-64 -O3)

std::array<int, 5> test()
{
    std::array<int, 5> values {{0, 1, 2, 3, 4}};
    return values;
}

It produced the typical assembly that I would expect

test():                               # @test()
        mov     rax, rdi
        mov     ecx, dword ptr [rip + .L__const.test().values+16]
        mov     dword ptr [rdi + 16], ecx
        movups  xmm0, xmmword ptr [rip + .L__const.test().values]
        movups  xmmword ptr [rdi], xmm0
        ret
.L__const.test().values:
        .long   0                       # 0x0
        .long   1                       # 0x1
        .long   2                       # 0x2
        .long   3                       # 0x3
        .long   4                       # 0x4

However for small arrays, it seems to have figured out a trick?

std::array<int, 3> test()
{
    std::array<int, 3> values {{0, 1, 2}};
    return values;
}

This was the corresponding assembly

test():                               # @test()
        movabs  rax, 4294967296
        mov     edx, 2
        ret

Where did that magic number (4294967296) come from? Is that essentially a value that can be reinterpret_cast back into an array of int somehow?

Lassiter answered 19/6, 2019 at 12:56 Comment(1)
On Godbolt you can mouseover a number to see it in Hex. If you look at asm often, you're going to get used to 4294967... being about 2^32 and thus a huge clue that you should look at the hex to see the upper/lower 32 bits. (Or for numbers just below 2^32, that it's actually a negative 32-bit integer.)Unwitting
P
40

A std::array<int, 3> is 96 bits wide on your implementation. As such the ABI declares that it should be returned in RAX + the low 32 bits of RDX (aka EDX).

4294967296 is 232, in hex it is $1'0000'0000. So the movabs stores 0 in the low order 32 bits of RAX, and 1 in the high order bits of RAX. The mov stores 2 in EDX (which is exactly what you wanted).

Prairial answered 19/6, 2019 at 13:5 Comment(5)
D'oh! Thanks sasha for proposing the edit, and caf for doing it.Prairial
Fun fact: with BMI2 a probably more optimal way to compile this would be mov edx, 2 and then rorx rax, rdx, 33 to create 1<<32 from the 1<<1. movabs is so big (10 bytes) that it's slow to decode and can take extra space in the uop cache. rorx is 6 bytes (3-byte VEX + opcode + modrm + imm8) but its immediate is small. It does have a dependency on the mov-immediate and can only run on the shift ports on Intel CPUs, but normally that's not a bottleneck (especially in a code path that leads to a ret, not contained in a tight loop)Unwitting
@PeterCordes I think the optimizer can be forgiven for not spotting this particular possibility - it's very specific to the particular data pattern!Prairial
clang does look for things like lea r64, [reg + disp8] to create a 2nd large constant instead of 2x movabs. e.g. see Signed saturated add of 64-bit ints? where clang uses movabs rcx, 9223372036854775807 ; lea rax, [rcx + 1] to materialize INT64_MAX and INT64_MIN in RCX and RAX respectively, in versions of the function that need both. godbolt.org/z/iD6Ml8 GCC just uses 2x movabs even with -OsUnwitting
GCC and clang both look for optimizations like that in some cases: godbolt.org/z/wctons like using a sub rax, 120 to create a 2nd nearby constant instead of using another movabs. They're not very good at it, though, and yeah probably aren't looking for the possibility of rotating :PUnwitting

© 2022 - 2024 — McMap. All rights reserved.