Why using AVX ymm(m256) instructions is ~4 times slower than xmm(m128)
Asked Answered
D

1

5

I wrote program that multiplies arr1*arr2 and save result to arr3.

Pseudocode:
arr3[i]=arr1[i]*arr2[i]

And I want to use AVX instructions. I have assembler code for m128 and m256 instructions (unrolled). Results show that using ymm is 4 times slower than xmm. But why? If lathency is the same..

Mul_ASM_AVX proc ; (float* RCX=arr1, float* RDX=arr2, float* R8=arr3, int R9 = arraySize)

    push rbx

    vpxor xmm0, xmm0, xmm0 ; Zero the counters
    vpxor xmm1, xmm1, xmm1
    vpxor xmm2, xmm2, xmm2
    vpxor xmm3, xmm3, xmm3

    mov rbx, r9
    sar r9, 4       ; Divide the count by 16 for AVX
    jz MulResiduals ; If that's 0, then we have only scalar mul to perfomance

LoopHead:
    ;add 16 floats

    vmovaps xmm0    , xmmword ptr[rcx]
    vmovaps xmm1    , xmmword ptr[rcx+16]
    vmovaps xmm2    , xmmword ptr[rcx+32]
    vmovaps xmm3    , xmmword ptr[rcx+48]

    vmulps  xmm0, xmm0, xmmword ptr[rdx]
    vmulps  xmm1, xmm1, xmmword ptr[rdx+16]
    vmulps  xmm2, xmm2, xmmword ptr[rdx+32]
    vmulps  xmm3, xmm3, xmmword ptr[rdx+48]

    vmovaps xmmword ptr[R8],    xmm0
    vmovaps xmmword ptr[R8+16], xmm1
    vmovaps xmmword ptr[R8+32], xmm2
    vmovaps xmmword ptr[R8+48], xmm3

    add rcx, 64 ; move on to the next 16 floats (4*16=64)
    add rdx, 64
    add r8,  64

    dec r9
    jnz LoopHead

MulResiduals:
    and ebx, 15 ; do we have residuals?
    jz Finished ; If not, we're done

ResidualsLoopHead:
    vmovss xmm0, real4 ptr[rcx]
    vmulss xmm0, xmm0, real4 ptr[rdx]
    vmovss real4 ptr[r8], xmm0
    add rcx, 4
    add rdx, 4
    dec rbx
    jnz ResidualsLoopHead

Finished:
    pop rbx ; restore caller's rbx
    ret
Mul_ASM_AVX endp

And for m256, ymm instructions:

Mul_ASM_AVX_YMM proc ; UNROLLED AVX

    push rbx

    vzeroupper
    mov rbx, r9
    sar r9, 5       ; Divide the count by 32 for AVX (8 floats * 4 registers = 32 floats)
    jz MulResiduals ; If that's 0, then we have only scalar mul to perfomance

LoopHead:
    ;add 32 floats
    vmovaps ymm0, ymmword ptr[rcx] ; 8 float each, 8*4 = 32
    vmovaps ymm1, ymmword ptr[rcx+32]
    vmovaps ymm2, ymmword ptr[rcx+64]
    vmovaps ymm3, ymmword ptr[rcx+96]

    vmulps ymm0, ymm0, ymmword ptr[rdx]
    vmulps ymm1, ymm1, ymmword ptr[rdx+32]
    vmulps ymm2, ymm2, ymmword ptr[rdx+64]
    vmulps ymm3, ymm3, ymmword ptr[rdx+96]

    vmovupd ymmword ptr[r8],    ymm0
    vmovupd ymmword ptr[r8+32], ymm1
    vmovupd ymmword ptr[r8+64], ymm2
    vmovupd ymmword ptr[r8+96], ymm3

    add rcx, 128    ; move on to the next 32 floats (4*32=128)
    add rdx, 128
    add r8,  128

    dec r9
    jnz LoopHead

MulResiduals:
    and ebx, 31 ; do we have residuals?
    jz Finished ; If not, we're done

ResidualsLoopHead:
    vmovss xmm0, real4 ptr[rcx]
    vmulss xmm0, xmm0, real4 ptr[rdx]
    vmovss real4 ptr[r8], xmm0
    add rcx, 4
    add rdx, 4
    dec rbx
    jnz ResidualsLoopHead

Finished:
    pop rbx ; restore caller's rbx
    ret
Mul_ASM_AVX_YMM endp

CPU-Z report:

  • Manufacturer: AuthenticAMD
  • Name: AMD FX-6300 Codename: Vishera
  • Specification: AMD FX(tm)-6300 Six-Core Processor
  • CPUID: F.2.0
  • Extended CPUID: 15.2
  • Technology: 32 nm
  • Instructions sets MMX (+), SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2,
    SSE4A, x86-64, AMD-V, AES, AVX, XOP, FMA3, FMA4
Disposable answered 11/2, 2020 at 16:2 Comment(4)
If you're using C you can remove the masm and performance tags and add x86, c and assembly.Effluence
How much slower? Your question doesn't seem to have any numbers.Pisa
~4 times slowerDisposable
If your array size is not a multiply of 32, the ResidualsLoopHead section will be executed much more times in the YMM version. This will become significant if the average size of your arrays is small.Barmecide
P
7

The cores in your old FX-6300 are the AMD Piledriver microarchitecture.

It decodes 256-bit instructions into two 128-bit uops. (Like all AMD before Zen 2). So you generally don't expect a speedup from AVX on that CPU, and 2-uop instructions can sometimes bottleneck the front-end. Although unlike Bulldozer, it can decode a 2-2 pattern of uops in 1 cycle so a sequence of 2 uop instructions can decode at a rate of 4 uops per clock, same as a sequence of single-uop instructions.

Being able to run AVX instructions is useful for avoiding movaps register copy instructions, and also being able to run the same code as Intel CPUs (which do have 256-bit wide execution units).

Your problem is probably that Piledriver has a showstopper performance bug with 256-bit stores. (Not present in Bulldozer, fixed in Steamroller / Excavator.) From Agner Fog's microarch PDF, in the Bulldozer-family section: disadvantages of AVX on that microarchitecture:

The throughput of 256-bit store instructions is less than half the throughput of 128-bit store instructions on Bulldozer and Piledriver. It is particularly bad on the Piledriver, which has a throughput of one 256-bit store per 17 - 20 clock cycles

(vs. one 128-bit store per clock). I think this applies even to stores that hit in L1d cache. (Or in the write-combining buffer; Bulldozer-family uses a write-through L1d cache, and yes this is generally considered a design mistake.)

If that's the problem, using vmovups [mem], xmm and vextractf128 [mem], ymm, 1 should help a lot. You can experiment with keeping the rest of your loop 256-bit. (Then it should perform about equal to the 128-bit loop. You can reduce the unrolling to get the same amount of work in both loops and still effectively 4 dep chains, but with smaller code-size. Or keep it at 4 registers to get 8x 128-bit FP multiply dep chains, with each 256-bit register having two halves.)

Note that if you can choose between aligned loads or aligned store, pick aligned stores. According to Agner's instruction table, vmovapd [mem], ymm (17 cycle throughput, 4 uops) is not quite as bad as vmovupd [mem], ymm (20 cycle throughput, 8 uops). But both are horrible compared to 2-uop 1 cycle vextractf128 + 1-uop vmovupd xmm on Piledriver.


Another disadvantage (which doesn't apply to your code because it has no reg-reg vmovaps instructions):

128-bit register-to-register moves have zero latency, while 256-bit register-to-register moves have a latency of 2 clocks plus a penalty of 2-3 clocks for using a different domain (see below) on Bulldozer and Piledriver. Register-to-register moves can be avoided in most cases thanks to the non-destructive 3-operand instructions.

(the low 128-bits benefit from mov-elimination; the high 128 are moved separately with a back-end uop.)

Pisa answered 11/2, 2020 at 16:20 Comment(3)
Is there anything known about what specifically caused that perf bug in Piledriver?Mensurable
@harold: not that I've read. Good question, now I'm curious! (But not curious enough to google it right now. Maybe later.)Pisa
Agner does not typically document page-crossing cases, but he notes that on Bulldozer/Piledriver/Streamroller unaligned loads that cross (4KiB) page boundaries execute at a throughput of 1 per 21 cycles. If I recall correctly, unaligned stores that cross page boundaries are much worse, and significantly degrade throughput (even if the penalty for stores that cross other cache line boundaries is not large).Tuition

© 2022 - 2024 — McMap. All rights reserved.