Why does p1007r0 std::assume_aligned remove the need for epilogue?

M

2

4

My understanding is that vectorization of code works something like this:

For data in array bellow the first address in the array that is the multiple of 128(or 256 or whatever SIMD instructions require) do slow element by element processing. Let's call this prologue.

For data in array between the first address that is multiple of 128 and last one address that is multiple of 128 use SIMD instruction.

For the data between last address that is multiple of 128 and end of array use slow element by element processing. Let's call this epilogue.

Now I understand why std::assume_aligned helps with prologue, but I do not get why it enables compiler to remove epilogue also.

Quote from proposal:

If we could make this property visible to the compiler, it could skip the loop prologue and epilogue

Magnification answered 17/5, 2018 at 22:21 Comment(2)

It allows to remove epilogue with additional size information (which is generally correctly chosen to use SIMD). – Genro 17/5, 2018 at 22:28

Even if you know the size is a multiple of 128 (say) but you didn't know the alignment was good then you'd always need an epilog. If you know both that the alignment and size are good you can skip both. Size may already be known, so this opens the possibility both are good. – Udo 17/5, 2018 at 22:32

T

1

You can see the effect on code-gen from using GNU C / C++ __builtin_assume_aligned.

gcc 7 and earlier targeting x86 (and ICC18) prefer to use a scalar prologue to reach an alignment boundary, then an aligned vector loop, then a scalar epilogue to clean up any leftover elements that weren't a multiple of a full vector.

Consider the case where the total number of elements is known at compile time to be a multiple of the vector width, but the alignment isn't known. If you knew the alignment, you don't need either a prologue or epilogue. But if not, you need both. The number of left-over elements after the last aligned vector is not known.

This Godbolt compiler explorer link shows these functions compiled for x86-64 with ICC18, gcc7.3, and clang6.0. clang unrolls very aggressively, but still uses unaligned stores. This seems like a weird way to spend that much code-size for a loop that just stores.

// aligned, and size a multiple of vector width
void set42_aligned(int *p) {
    p = (int*)__builtin_assume_aligned(p, 64);
    for (int i=0 ; i<1024 ; i++ ) {
        *p++ = 0x42;
    }
}

 # gcc7.3 -O3   (arch=tune=generic for x86-64 System V: p in RDI)

    lea     rax, [rdi+4096]              # end pointer
    movdqa  xmm0, XMMWORD PTR .LC0[rip]  # set1_epi32(0x42)
.L2:                                     # do {
    add     rdi, 16
    movaps  XMMWORD PTR [rdi-16], xmm0
    cmp     rax, rdi
    jne     .L2                          # }while(p != endp);
    rep ret

This is pretty much exactly what I'd do by hand, except maybe unrolling by 2 so OoO exec could discover the loop exit branch being not-taken while still chewing on the stores.

Thus unaligned version includes a prologue and epilogue:

// without any alignment guarantee
void set42(int *p) {
    for (int i=0 ; i<1024 ; i++ ) {
        *p++ = 0x42;
    }
}

~26 instructions of setup, vs. 2 from the aligned version

.L8:            # then a bloated loop with 4 uops instead of 3
    add     eax, 1
    add     rdx, 16
    movaps  XMMWORD PTR [rdx-16], xmm0
    cmp     ecx, eax
    ja      .L8               # end of main vector loop

 # epilogue:
    mov     eax, esi    # then destroy the counter we spent an extra uop on inside the loop.  /facepalm
    and     eax, -4
    mov     edx, eax
    sub     r8d, eax
    cmp     esi, eax
    lea     rdx, [r9+rdx*4]   # recalc a pointer to the last element, maybe to avoid a data dependency on the pointer from the loop.
    je      .L5
    cmp     r8d, 1
    mov     DWORD PTR [rdx], 66      # fully-unrolled final up-to-3 stores
    je      .L5
    cmp     r8d, 2
    mov     DWORD PTR [rdx+4], 66
    je      .L5
    mov     DWORD PTR [rdx+8], 66
.L5:
    rep ret

Even for a more complex loop which would benefit from a little bit of unrolling, gcc leaves the main vectorized loop not unrolled at all, but spends boatloads of code-size on fully-unrolled scalar prologue/epilogue. It's really bad for AVX2 256-bit vectorization with uint16_t elements or something. (up to 15 elements in the prologue/epilogue, rather than 3). This is not a smart tradeoff, so it helps gcc7 and earlier significantly to tell it when pointers are aligned. (The execution speed doesn't change much, but it makes a big difference for reducing code-bloat.)

BTW, gcc8 favours using unaligned loads/stores, on the assumption that data often is aligned. Modern hardware has cheap unaligned 16 and 32-byte loads/stores, so letting the hardware handle the cost of loads/stores that are split across a cache-line boundary is often good. (AVX512 64-byte stores are often worth aligning, because any misalignment means a cache-line split on every access, not every other or every 4th.)

Another factor is that earlier gcc's fully-unrolled scalar prologues/epilogues are crap compared to smart handling where you do one unaligned potentially-overlapping vector at the start/end. (See the epilogue in this hand-written version of set42). If gcc knew how to do that, it would be worth aligning more often.

Tithonus answered 18/5, 2018 at 1:25 Comment(2)

1 huge problem: you store hex 42, not decimal :P Other than that good answer, but a bit of an overkill(simple answer to my Q seems to be that often len of the array is known to be multiple of 64.128/256...). – Magnification 18/5, 2018 at 14:3

@NoSenseEtAl: If your question had said which part of the complete answer you were missing, I could have made it simpler :P Also, even if you do need an epilogue, it can often be significantly simpler, especially for a fixed iteration count. e.g. i<1026 would need just one more movq or movsd store to do the last 2 elements, instead of needing to handle 0..3 element. (And BTW, this isn't the real answer to life, the unverse, and everything, just a related example :) – Tithonus 18/5, 2018 at 19:54

E

1

This is discussed in the document itself in Section 5:

A function that returns a pointer T* , and guarantees that it will point to over-aligned memory, could return like this:
T* get_overaligned_ptr()
{
// code...
return std::assume_aligned<N>(_data);
}
This technique can be used e.g. in the begin() and end() implementations of a class wrapping an over-aligned range of data. As long as such functions are inline, the over-alignment will be transparent to the compiler at the call-site, enabling it to perform the appropriate optimisations without any extra work by the caller.

The begin() and end() methods are data accessors for the over-aligned buffer _data. That is, begin() returns a pointer to the first byte of the buffer and end() returns a pointer to one byte past the last byte of the buffer.

Suppose they are defined as follows:

T* begin()
{
// code...
return std::assume_aligned<N>(_data);
}
T* end()
{
// code...
return _data + size; // No alignment hint!
}

In this case, the compiler may not be able to eliminate the epilogue. But if there were defined as follows:

T* begin()
{
// code...
return std::assume_aligned<N>(_data);
}
T* end()
{
// code...
return std::assume_aligned<N>(_data + size);
}

Then the compiler would be able to eliminate the epilogue. For example, if N is 128 bits, then every single 128-bit chunk of the buffer is guaranteed to be 128-bit aligned. Note that this is only possible when the size of the buffer is a multiple of the alignment.

Enamor answered 17/5, 2018 at 23:19 Comment(0)

T

1

You can see the effect on code-gen from using GNU C / C++ __builtin_assume_aligned.

gcc 7 and earlier targeting x86 (and ICC18) prefer to use a scalar prologue to reach an alignment boundary, then an aligned vector loop, then a scalar epilogue to clean up any leftover elements that weren't a multiple of a full vector.

Consider the case where the total number of elements is known at compile time to be a multiple of the vector width, but the alignment isn't known. If you knew the alignment, you don't need either a prologue or epilogue. But if not, you need both. The number of left-over elements after the last aligned vector is not known.

This Godbolt compiler explorer link shows these functions compiled for x86-64 with ICC18, gcc7.3, and clang6.0. clang unrolls very aggressively, but still uses unaligned stores. This seems like a weird way to spend that much code-size for a loop that just stores.

// aligned, and size a multiple of vector width
void set42_aligned(int *p) {
    p = (int*)__builtin_assume_aligned(p, 64);
    for (int i=0 ; i<1024 ; i++ ) {
        *p++ = 0x42;
    }
}

 # gcc7.3 -O3   (arch=tune=generic for x86-64 System V: p in RDI)

    lea     rax, [rdi+4096]              # end pointer
    movdqa  xmm0, XMMWORD PTR .LC0[rip]  # set1_epi32(0x42)
.L2:                                     # do {
    add     rdi, 16
    movaps  XMMWORD PTR [rdi-16], xmm0
    cmp     rax, rdi
    jne     .L2                          # }while(p != endp);
    rep ret

This is pretty much exactly what I'd do by hand, except maybe unrolling by 2 so OoO exec could discover the loop exit branch being not-taken while still chewing on the stores.

Thus unaligned version includes a prologue and epilogue:

// without any alignment guarantee
void set42(int *p) {
    for (int i=0 ; i<1024 ; i++ ) {
        *p++ = 0x42;
    }
}

~26 instructions of setup, vs. 2 from the aligned version

.L8:            # then a bloated loop with 4 uops instead of 3
    add     eax, 1
    add     rdx, 16
    movaps  XMMWORD PTR [rdx-16], xmm0
    cmp     ecx, eax
    ja      .L8               # end of main vector loop

 # epilogue:
    mov     eax, esi    # then destroy the counter we spent an extra uop on inside the loop.  /facepalm
    and     eax, -4
    mov     edx, eax
    sub     r8d, eax
    cmp     esi, eax
    lea     rdx, [r9+rdx*4]   # recalc a pointer to the last element, maybe to avoid a data dependency on the pointer from the loop.
    je      .L5
    cmp     r8d, 1
    mov     DWORD PTR [rdx], 66      # fully-unrolled final up-to-3 stores
    je      .L5
    cmp     r8d, 2
    mov     DWORD PTR [rdx+4], 66
    je      .L5
    mov     DWORD PTR [rdx+8], 66
.L5:
    rep ret

Even for a more complex loop which would benefit from a little bit of unrolling, gcc leaves the main vectorized loop not unrolled at all, but spends boatloads of code-size on fully-unrolled scalar prologue/epilogue. It's really bad for AVX2 256-bit vectorization with uint16_t elements or something. (up to 15 elements in the prologue/epilogue, rather than 3). This is not a smart tradeoff, so it helps gcc7 and earlier significantly to tell it when pointers are aligned. (The execution speed doesn't change much, but it makes a big difference for reducing code-bloat.)

BTW, gcc8 favours using unaligned loads/stores, on the assumption that data often is aligned. Modern hardware has cheap unaligned 16 and 32-byte loads/stores, so letting the hardware handle the cost of loads/stores that are split across a cache-line boundary is often good. (AVX512 64-byte stores are often worth aligning, because any misalignment means a cache-line split on every access, not every other or every 4th.)

Another factor is that earlier gcc's fully-unrolled scalar prologues/epilogues are crap compared to smart handling where you do one unaligned potentially-overlapping vector at the start/end. (See the epilogue in this hand-written version of set42). If gcc knew how to do that, it would be worth aligning more often.

Tithonus answered 18/5, 2018 at 1:25 Comment(2)

1 huge problem: you store hex 42, not decimal :P Other than that good answer, but a bit of an overkill(simple answer to my Q seems to be that often len of the array is known to be multiple of 64.128/256...). – Magnification 18/5, 2018 at 14:3

@NoSenseEtAl: If your question had said which part of the complete answer you were missing, I could have made it simpler :P Also, even if you do need an epilogue, it can often be significantly simpler, especially for a fixed iteration count. e.g. i<1026 would need just one more movq or movsd store to do the last 2 elements, instead of needing to handle 0..3 element. (And BTW, this isn't the real answer to life, the unverse, and everything, just a related example :) – Tithonus 18/5, 2018 at 19:54

Recommended topics

Hot tags