Fastest way to multiply an array of int64_t?
Asked Answered
D

2

26

I want to vectorize the multiplication of two memory aligned arrays. I didn't find any way to multiply 64*64 bit in AVX/AVX2, so I just did loop-unroll and AVX2 loads/stores. Is there a faster way to do this?

Note: I don't want to save the high-half result of each multiplication.

void multiply_vex(long *Gi_vec, long q, long *Gj_vec){

    int i;
    __m256i data_j, data_i;
    __uint64_t *ptr_J = (__uint64_t*)&data_j;
    __uint64_t *ptr_I = (__uint64_t*)&data_i;


    for (i=0; i<BASE_VEX_STOP; i+=4) {
        data_i = _mm256_load_si256((__m256i*)&Gi_vec[i]);
        data_j = _mm256_load_si256((__m256i*)&Gj_vec[i]);

        ptr_I[0] -= ptr_J[0] * q;
        ptr_I[1] -= ptr_J[1] * q;
        ptr_I[2] -= ptr_J[2] * q;
        ptr_I[3] -= ptr_J[3] * q;

        _mm256_store_si256((__m256i*)&Gi_vec[i], data_i);
    }


    for (; i<BASE_DIMENSION; i++)
        Gi_vec[i] -= Gj_vec[i] * q;
}

UPDATE: I'm using the Haswell microarchitecture with both ICC/GCC compilers. So both AVX and AVX2 is fine. I substitute the -= by the C intrisic _mm256_sub_epi64 after the multiplication loop-unroll, where it get some speedup. Currently, it is ptr_J[0] *= q; ...

I use __uint64_t but is a error. The right data type is __int64_t.

Dandiprat answered 18/5, 2016 at 10:1 Comment(9)
If you do it this way you are taking a huge penalty of moving from simd register to alu register. Just doesn't worth it.Cephalalgia
gcc generates Karatsuba-ish code using 3 32x32→64 multiplications, 3 32-bit-shifts and two adds. Looks to be fairly good for ILP, too.Cardenas
@Cardenas It isn't quite Karatsuba-ish. The 64x64 to low 64-bit multiply doesn't need the top half. So you don't need the high x high multiply at all. That leaves the other 3.Weider
@EOF: I only see two 64bit shifts (with a count of 32) in gcc5.3's auto-vectorized output. There are two adds, though. (The sub in the output is from the -=). Anyway, on Haswell, it's potentially slower than an unrolled hand-tuned scalar asm loop (load/imul/sub-with-memory-dest), bottlenecking on load/store throughput after unrolling. With C intrinsics and compiler output, you're probably going to get faster code with vectors, though.Cerumen
@davmac: for gcc, __m256i is defined with a "may alias" attribute, which is why it does what the user expects with gcc. You're absolutely right that it's not safe or a good idea in general. A union with a struct is a better idea.Cerumen
This may be of interest.Galacto
It's not clear to me if you want a AVX or AVX2 solution (or both). It makes a big difference.Galacto
@Cephalalgia Thank you for the note. @PeterCordes I substitute the -= by the C intrisic _mm256_sub_epi64 after the multiplication loop-unroll. Right now I have ptr_J[0] *= q;. I get a speedup from this. @Zboson I'm using a Haswell microarchitecture, so I can use both.Demodulation
@HélderGonçalves: you'll get a bigger speedup from letting gcc auto-vectorize the whole thing, or using the intrinsics code from my answer. Or even from plain scalar code without vectorization. Your code still has to extract and insert into the ymm vector. (It's probably a slowdown. Did you test vs. a baseline?) Also, don't use __int64_t directly. Just #include <stdint.h> and use int64_t so your code is portable. Fortunately, the low 64bits result of a 64bx64b multiply is the same whether the input is signed or unsigned, so the code gives identical results either way.Cerumen
C
24

You seem to be assuming long is 64bits in your code, but then using __uint64_t as well. In 32bit, the x32 ABI, and on Windows, long is a 32bit type. Your title mentions long long, but then your code ignores it. I was wondering for a while if your code was assuming that long was 32bit.

You're completely shooting yourself in the foot by using AVX256 loads but then aliasing a pointer onto the __m256i to do scalar operations. gcc just gives up and gives you the terrible code you asked for: vector load and then a bunch of extract and insert instructions. Your way of writing it means that both vectors have to be unpacked to do the sub in scalar as well, instead of using vpsubq.

Modern x86 CPUs have very fast L1 cache that can handle two operations per clock. (Haswell and later: two loads and one store per clock). Doing multiple scalar loads from the same cache line is better than a vector load and unpacking. (Imperfect uop scheduling reduces the throughput to about 84% of that, though: see below)


gcc 5.3 -O3 -march=haswell (Godbolt compiler explorer) auto-vectorizes a simple scalar implementation pretty well. When AVX2 isn't available, gcc foolishly still auto-vectorizes with 128b vectors: On Haswell, this will actually be about 1/2 the speed of ideal scalar 64bit code. (See the perf analysis below, but substitute 2 elements per vector instead of 4).

#include <stdint.h>    // why not use this like a normal person?
#define BASE_VEX_STOP 1024
#define BASE_DIMENSION 1028

// restrict lets the compiler know the arrays don't overlap,
// so it doesn't have to generate a scalar fallback case
void multiply_simple(uint64_t *restrict Gi_vec, uint64_t q, const uint64_t *restrict Gj_vec){
    for (intptr_t i=0; i<BASE_DIMENSION; i++)   // gcc doesn't manage to optimize away the sign-extension from 32bit to pointer-size in the scalar epilogue to handle the last less-than-a-vector elements
        Gi_vec[i] -= Gj_vec[i] * q;
}

inner loop:

.L4:
    vmovdqu ymm1, YMMWORD PTR [r9+rax]        # MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B], MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B]
    add     rcx, 1    # ivtmp.30,
    vpsrlq  ymm0, ymm1, 32      # tmp174, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B],
    vpmuludq        ymm2, ymm1, ymm3        # tmp173, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B], vect_cst_.25
    vpmuludq        ymm0, ymm0, ymm3        # tmp176, tmp174, vect_cst_.25
    vpmuludq        ymm1, ymm4, ymm1        # tmp177, tmp185, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B]
    vpaddq  ymm0, ymm0, ymm1    # tmp176, tmp176, tmp177
    vmovdqa ymm1, YMMWORD PTR [r8+rax]        # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B]
    vpsllq  ymm0, ymm0, 32      # tmp176, tmp176,
    vpaddq  ymm0, ymm2, ymm0    # vect__13.24, tmp173, tmp176
    vpsubq  ymm0, ymm1, ymm0    # vect__14.26, MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__13.24
    vmovdqa YMMWORD PTR [r8+rax], ymm0        # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__14.26
    add     rax, 32   # ivtmp.32,
    cmp     rcx, r10  # ivtmp.30, bnd.14
    jb      .L4 #,

Translate that back to intrinsics if you want, but it's going to be a lot easier to just let the compiler autovectorize. I didn't try to analyse it to see if it's optimal.

If you don't usually compile with -O3, you could use #pragma omp simd before the loop (and -fopenmp).

Of course, instead of a scalar epilogue, it would prob. be faster to do an unaligned load of the last 32B of Gj_vec, and store into the last 32B of Gi_vec, potentially overlapping with the last store from the loop. (A scalar fallback is still needed if the arrays are smaller than 32B.)

GCC13 and Clang17 still both vectorize this way, using another two vpmuludq instead of 1x vpmulld to get both 32-bit cross products done with slightly less shuffling. But they are able to optimize away the insert/extract in the OP's pointer-aliasing ptr_I[0] -= ptr_J[0] * q;, so that compiles to the same asm loop as the simple scalar version.


Improved vector intrinsic version for AVX2

From my comments on Z Boson's answer. Based on Agner Fog's vector class library code, with various improvements in 2023 along with a bugfix. (For 8 years, nobody noticed that I was adding the cross products to the bottom of the full result, not to the top half to match their place-values; thanks @Petr for actually testing this.)

Agner Fog's version saves an instruction but bottlenecks on the shuffle port by using phadd + pshufd where I use psllq / pand / paddd. vpmulld to get both cross products packed together is 2 uops on Haswell and later Intel (same as two separate vpmuludq instructions but with less shuffling of inputs), but only 1 uops for either of two ports on Zen 3 and later AMD.

Since one of your operands is constant, make sure to pass set1(q) as b, not a, so the "b_swap" shuffle can be hoisted.

// replace hadd -> shuffle (4 uops) with shift/and/add (3 uops with less shuffle-port pressure)
// The constant takes 2 insns to generate outside a loop.
__m256i mul64_avx2 (__m256i a, __m256i b)
{
    // There is no vpmullq until AVX-512. Split into 32-bit multiplies
    // Given a and b composed of high<<32 | low  32-bit halves
    // a*b = a_low*(u64)b_low  + (u64)(a_high*b_low + a_low*b_high)<<32;  // same for signed or unsigned a,b since we aren't widening to 128
    // the a_high * b_high product isn't needed for non-widening; its place value is entirely outside the low 64 bits.

    __m256i b_swap  = _mm256_shuffle_epi32(b, _MM_SHUFFLE(2,3, 0,1));   // swap H<->L
    __m256i crossprod  = _mm256_mullo_epi32(a, b_swap);                 // 32-bit L*H and H*L cross-products

    __m256i prodlh = _mm256_slli_epi64(crossprod, 32);          // bring the low half up to the top of each 64-bit chunk 
    __m256i prodhl = _mm256_and_si256(crossprod, _mm256_set1_epi64x(0xFFFFFFFF00000000)); // isolate the other, also into the high half were it needs to eventually be
    __m256i sumcross = _mm256_add_epi32(prodlh, prodhl);       // the sum of the cross products, with the low half of each u64 being 0.

    __m256i prodll  = _mm256_mul_epu32(a,b);                  // widening 32x32 => 64-bit  low x low products
    __m256i prod    = _mm256_add_epi32(prodll, sumcross);     // add the cross products into the high half of the result
    return  prod;
}

This has been tested and now works, including for values with non-zero high and low halves. See it on Godbolt.

Note that this doesn't include the final subtract from the question, only the multiply.

This version should perform a bit better on Haswell than gcc's autovectorized version. (Like maybe one vector per 4 cycles instead of one vector per 5 cycles, bottlenecked on port0 throughput. Replacing vpsllq (srli_epi64) with vpshufb with a vector constant, or vpslldq by 4 and doing the masking after adding, can get that port 0 bottleneck down to 3 cycles.) Either way is fine on later CPUs like Skylake and Rocket Lake, where https://uica.uops.info/ estimates a throughput bottleneck of 2.67 based on execution-port pressure (for just the multiply without a loop or subtract), with perfect scheduling. I went with the shift version so it didn't need an extra vector constant, and let the shift/and run in parallel for better critical-path latency in case this is part of a dependency chain, unlike in the question where we don't do much else with each element between load and store, so out-of-order exec doesn't have to work hard to overlap between iterations.

We can avoid vector constants entirely by doing sumcross = ((crossprod >> 32) + crossprod) << 32. That still takes 3 instructions, but the AND becomes a shift, so it's another uop that competes with multiplies (on Intel P-cores). And it has worse critical-path latency. But it avoids the constant, and uses fewer temporary registers (for the same reason it has worse ILP and higher latency), so could be useful if multiplying as part of a larger loop body that would otherwise run out of registers.

An AVX1 or SSE4.1 version (two elements per vector) would not be faster than 64-bit scalar imul r64, [mem]. Don't do it unless you already have your data in vectors, and want the result in a vector. (It might be better than extracting to scalar and back.) Or on a CPU that has slow imul r64, r64 like perhaps Silvermont-family. In that case _mm_add_epi32 is faster than _mm_add_epi64.

add_epi64 isn't slower or larger code-size on any AVX2 CPUs, but the SSE version was on some earlier CPUs. I used add_epi32 when possible because it won't be slower, and could possibly save some power or perhaps have lower latency on some hypothetical future CPU. According to https://uops.info/, Intel since Haswell and AMD since Zen 1 run vpaddd identically to vpaddq, including Alder Lake E-cores.


Perf analysis of GCC's autovectorized code (not the intrinsic version)

Background: see Agner Fog's insn tables and microarch guide, and other links in the tag wiki.

Until AVX512 (see below), this is probably only barely faster than scalar 64bit code: imul r64, m64 has a throughput of one per clock on Intel CPUs (but one per 4 clocks on AMD Bulldozer-family). load/imul/sub-with-memory-dest is 4 fused-domain uops on Intel CPUs (with an addressing mode that can micro-fuse, which gcc fails to use). The pipeline width is 4 fused-domain uops per clock, so even a large unroll can't get this to issue at one-per-clock. With enough unrolling, we'll bottleneck on load/store throughput. 2 loads and one store per clock is possible on Haswell, but stores-address uops stealing load ports will lower the throughput to about 81/96 = 84% of that, according to Intel's manual.

So perhaps the best way for Haswell would load and multiply with scalar, (2 uops), then vmovq / pinsrq / vinserti128 so you can do the subtract with a vpsubq. That's 8 uops to load&multiply all 4 scalars, 7 shuffle uops to get the data into a __m256i (2 (movq) + 4 (pinsrq is 2 uops) + 1 vinserti128), and 3 more uops to do a vector load / vpsubq / vector store. So that's 18 fused-domain uops per 4 multiplies (4.5 cycles to issue), but 7 shuffle uops (7 cycles to execute). So nvm, this is no good compared to pure scalar.


The autovectorized code is using 8 vector ALU instructions for each vector of four values. On Haswell, 5 of those uops (multiplies and shifts) can only run on port 0, so no matter how you unroll this algorithm it will achieve at best one vector per 5 cycles (i.e. one multiply per 5/4 cycles.)

The shifts could be replaced with pshufb (port 5) to move the data and shift in zeros. (Other shuffles don't support zeroing instead of copying a byte from the input, and there aren't any known-zeros in the input that we could copy.)

paddq / psubq can run on ports 1/5 on Haswell, or p015 on Skylake.

Skylake runs pmuludq and immediate-count vector shifts on on p01, so it could in theory manage a throughput of one vector per max(5/2, 8/3, 11/4) = 11/4 = 2.75 cycles. So it bottlenecks on total fused-domain uop throughput (including the 2 vector loads and 1 vector store). So a bit of loop unrolling will help. Probably resource conflicts from imperfect scheduling will bottleneck it to slightly less than 4 fused-domain uops per clock. The loop overhead can hopefully run on port 6, which can only handle some scalar ops, including add and compare-and-branch, leaving ports 0/1/5 for vector ALU ops, since they're close to saturating (8/3 = 2.666 clocks). The load/store ports are nowhere near saturating, though.

So, Skylake can theoretically manage one vector per 2.75 cycles (plus loop overhead), or one multiply per ~0.7 cycles with GCC auto-vectorization, vs. Haswell's best option (one per ~1.2 cycles in theory with scalar, or one per 1.25 cycles in theory with vectors). The scalar one per ~1.2 cycles would probably require a hand-tuned asm loop, though, because compilers don't know how to use a one-register addressing mode for stores, and a two-register addressing mode for loads (dst + (src-dst) and increment dst).

Also, if your data isn't hot in L1 cache, getting the job done with fewer instructions lets the frontend get ahead of the execution units, and get started on the loads before the data is needed. Hardware prefetch doesn't cross page lines, so a vector loop will probably beat scalar in practice for large arrays, and maybe even for smaller arrays.


AVX-512DQ introduces a 64bx64b->64b vector multiply

GCC can auto-vectorize using it if you add -mavx512dq. (In practice use -march=x86-64-v4, or -march=native or -march=skylake-avx512 or whatever to enable other stuff and set turning options.)

.L4:
    vmovdqu64       zmm0, ZMMWORD PTR [r8+rax]    # vect__11.23, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B]
    add     rcx, 1    # ivtmp.30,
    vpmullq zmm1, zmm0, zmm2  # vect__13.24, vect__11.23, vect_cst_.25
    vmovdqa64       zmm0, ZMMWORD PTR [r9+rax]    # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B]
    vpsubq  zmm0, zmm0, zmm1    # vect__14.26, MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__13.24
    vmovdqa64       ZMMWORD PTR [r9+rax], zmm0    # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__14.26
    add     rax, 64   # ivtmp.32,
    cmp     rcx, r10  # ivtmp.30, bnd.14
    jb      .L4 #,

So AVX512DQ (expected to be part of Skylake multi-socket Xeon (Purley) in ~2017) will give a much larger than 2x speedup (from wider vectors) if these instructions are pipelined at one per clock.

Update: Skylake-AVX512 (aka SKL-X or SKL-SP) runs VPMULLQ at one per 1.5 cycles for xmm, ymm, or zmm vectors. It's 3 uops with 15c latency. (With maybe an extra 1c of latency for the zmm version, if that's not a measurement glitch in the AIDA results.)

vpmullq is much faster than anything you can build out of 32-bit chunks, so it's very much worth having an instruction for this even if current CPUs don't have 64-bit-element vector-multiply hardware. (Presumably they use the mantissa multipliers in the FMA units.)

Zen 4 runs vpmullq as a single uop for either of 2 ports, so 2/clock throughput for 256-bit vectors, 1/clock for 512-bit vectors. Later Intel CPUs (like Alder Lake / Sapphire Rapids) still run it as 3 uops for ports 0/1, so 1.5c throughput. https://uops.info/

Cerumen answered 19/5, 2016 at 10:24 Comment(40)
This is a very helpful post. While this site does not encourage questions to ask about recommended resources, perhaps a comment about Where can I learn about this stuff step-by-step in a tutorial way? could be much appreciated.Akers
@Matsmath: No worries. updated to link to Agner Fog's guides and the x86 tag wiki. I learned this stuff by reading Agner Fog's guides, and reading the realworldtech.com forums and microarchitecture writeups. (Just reading, though; I've only posted once or twice there). If you don't know how a simple pipelined CPU works, maybe read about RISC pipelines.Cerumen
I think you could be more precise when you write "Modern Intel CPUs". Maybe "Since Haswell (2013)..." Because before Haswell Intel CPUs could not do two SIMD loads and one SIMD store per cycle. Very few people at my work have anything as modern as Haswell.Galacto
@Zboson: That was a sloppy edit after I noticed that the algorithm was a read-modify-write. (-= not =). Updated to apply to Sandybridge-family and AMD Bulldozer-family.Cerumen
This may be iteresting to you. For 128bx128b to 128b (lower) for unsigned you only need 32bx32b->64b. I mean 64bx64b ->64b (lower) does not help. But for signed 128bx128b to 128b (lower) you need a few more instructions if you don't have 64bx64b ->64b (lower). That's one reason why vpmullq (if it is not to slow) could be interesting.Galacto
Oh, never mind, I forgot that mulx will still beat SIMD 128-bit mul anyway so vpmullq is probably only interesting for 64bx64b ->64b (lower).Galacto
@Zboson: fun fact: mulx has 4 cycle latency vs. 3 cycle latency for mul. IDK why. They're both 2 uops (p1+p5 vs. p1+p6). I keep forgetting that mulx produces a 128b result, though. But of course it does, that's what makes it useful to interleave with adc. (And otherwise its only purpose would be a 3-operand non-destructive imul r,r/m: useful in this case to save one fused-domain uop for load+mul.)Cerumen
Agner Fog's tables place [V]PS[L/R]LDQ on port 5 on Haswell, staying out of the way of the multiplies.Cardenas
@EOF: That's the byte-shift instruction (for one 128b element). We need to shift in zeros into the high 32 of each 64bit element. pshufb will do the trick, but psrldq won't.Cerumen
@PeterCordes Fair point. On the other hand [V]PAND can run on ports 0,1,5 and is has a single cycle latency.Cardenas
@EOF: We're bottlenecked on fused-domain uop throughput on Skylake, so more instructions hurts there. Using a shuffle-immediate + AND isn't worth it to avoid needing a 16B pshufb shuffle control mask. Even on Haswell, an extra vpand will probably steal some port0 cycles, reducing total throughput.Cerumen
So the "moral of the story" is that for AVX you can't beat 64-bit scalar mult but with AVX2 you can maybe marginally beat 64-bit scalar mult?Galacto
@Zboson: I think the intrinsic AVX2 version is a bit faster than scalar on Haswell. Both versions will be much faster on Skylake. Also note that getting a compiler to make the "ideal" scalar code is probably really hard, so the penalty for compilers being dumb will be worse than for vectors. It requires one-register addressing modes for the dst, but not src. With a big unroll, wasting an insn on src inc is ok. An AVX version on IvyBridge is not quite as bad as Haswell, but prob. still worse than scalar. pmulld is only 1 uop there. Emulating the shift with pshufb would still help.Cerumen
I have a very strong feeling that vpmullq is not going to be 1 per clock. (2 if you include both ports.) The existence of the IFMA52 instructions basically says that the chip is not going to have a native SIMD 64-bit multiplier. And to be fair, going full 64-bit multipliers in the SIMD unit is lot of extra area.Weider
@Zboson Some latency #'s are out for Skylake Purley. As I expected, vpmullq is not 1 uop. It appears to be 3 with a latency of about 15 cycles (3 x 5 cycle multiply). I actually predicted it could be 3 a while back.Weider
@Mysticial: Thanks! Those numbers don't say what port instructions run on, so you can't tell which throughput=1 instructions conflict with each other. I did some testing a while ago on a Google Cloud SKL-AVX512 VM. I think my conclusion was that mask instructions (kand / kunpck etc.) run on port5, conflicting with shuffles, but that was a while ago and GCE doesn't expose perf counters so it was guesswork. I might be mis-remembering, and I don't remember if I made notes.Cerumen
@Mysticial: BTW, Anandtech says the 6 and 8-core part will have half-throughput for AVX512 FMA, with some weird execution-unit sharing, and having an FMA unit on port5. (Maybe not execution port sharing though; hopefully ZMM FMA is still only 1 uop and can issue at the same time as a non-FP instruction for port0 or port1) That report confirms that the 10-core part has full-throughput for 512b FMA, but doesn't shed any light on the execution-port/unit possible weirdness.Cerumen
@PeterCordes Here's a benchmark that shows an ES 6-core having full-throughput AVX512. But I have yet to find out if retail 6 and 8-cores have half or full-throughput. I literally just finished assembling a 7900X system which is sitting on my kitchen counter right now running updates. So I have yet to run any proper tests on it. But that won't answer the question of whether the 6 and 8 cores will have full-throughput AVX512.Weider
Right away I can tell that most of my AVX512 code has some severe bottlenecks from yet-to-be-determined sources. But I don't want to run too many tests while the system is updating since there seems to be a hint of instability that I'll need to sort out later. The motherboard seems to be overclocking it to 4 GHz without me asking it to do so. 4 GHz on all cores is a bit too high to be the normal turbo. I'm gonna be busy this weekend.Weider
@Mysticial: Looking forward to finding out the full details once the dust settles. :) Hopefully someone will give Agner Fog remote access to a system so he can update his insn tables, too.Cerumen
Let us continue this discussion in chat.Weider
@Weider from what I read on anandtech the i7-7800X and i7-7820X don't increase the throughput of AVX512 compared to AVX. They split a 512-bit operation between two 256-bit ports (similar to how AMD splits AVX between two 128-bit ports). However, the 7900X has a dedicated AVX512 port so it does double the throughput. Though it seems it by be complicated to optimized with one 512-bit port and two 256-bit ports to get max throughput (two 512-bit ports would be simpler).Galacto
@Zboson: My guess is that they have the weird multi-port stuff so flaws in the silicon somewhere in an FMA unit let them still use that core with full throughput for 256b and 128b FMA. So they can fuse off a couple whole cores, and parts of the FMA units in some other cores, and sell it as an 8-core i7-7820X. On the high-core and extreme-core silicon, where all the SKUs must have full-throughput FMA units, they might (I hope) just widen the FMA units on p0/p1 to 512b, since any flaw anywhere in an FMA unit means the core is ruined.Cerumen
@Zboson If you read the chat transcript between me and Peter, you'll notice that I've been observing a sort of "AVX512 throttle" that kicks in when the TDP is exceeded. I strongly suspect that this involves turning off that port5 dedicated FMA. I don't have enough evidence to confirm this theory, but so far everything I've seen agrees with it.Weider
@PeterCordes They bin them differently. If there's anything wrong with the core (bad FMA, low clock speeds), they simply disable the entire core and sell it a lower core-count part. Every single activated core on these chips are required to hit a minimum of 4.3 GHz. That's high enough that there's almost no room for any errors.Weider
@Mysticial: My thinking is that the routing from ports to actual FMA execution units could be chosen after testing for defects. With 4 x 256b "lanes" of FMA units, only 2 of them need to be defect-free for the core to be usable. p0 and p1 get wired up to lanes that work. If all 4 work, then that core can be used in a full-FMA-throughput SKU with p5 wired too. This design would let them fuse off defects with smaller granularity than at the core level. If they weren't doing something like this, why the heck would they complicate things instead of just widening the p0 / p1 FMAs to 512b?Cerumen
@PeterCordes I'm tempted to think that it's not that simple to "simply rewire the working 256-bit lanes". The logic there is probably not trivial. Since each 256-bit lane would need to be designed to be either the bottom or top half of the 512-bit unit. And that probably complicates a lot of the routing as well. The way it's designed now, that port5 FMA is the one and only "optional" feature that can be disabled without turning off the entire core. And based on the throttling that I'm seeing, they made it run-time toggleable.Weider
@Mysticial: oh right. That makes much more sense. Anyway, that still lets them fuse off an FMA but keep the core (for a 512b-half-throughput SKU), if there's a defect in the p5 FMA but the core is otherwise fine. They couldn't do that nearly as easily if they just widened the p0 and p1 FMA units. But they might just widen p01 in the higher core-count silicon since there are no 512b-half-throughput SKUs that use that silicon, so a defect anywhere in any FMA unit kills the core.Cerumen
@Mysticial, why did Intel bother implementing vpmullq? Is this for forward planning? A future processor could implement it as a single uop. That's the only good reason I can see for them to implement vpmullq now.Galacto
@Zboson: One per 1.5 cycles is much better throughput than our AVX2 32-bit-chunks implementations achieve, and probably also better latency. 3 uops is not bad at all. vpmulld is still 2 uops on Haswell and later, including Skylake. (up from 1 uop on SnB/IvB).Cerumen
@PeterCordes I have tried the proposed solution, but for some reason it doesn't produce the same results as the original. I have tests for these so I can experiment with various implementations and when I use mul64_haswell I get a failure: ``` Input #1: {0, 0, 2067431209228230896, 1490001851655522828} Input #2: {9199068884856851455, 1, 0, 8833705253607473239} Observed: {0, 0, 0, 141202506190423098} Expected: {0, 0, 0, 11542234448223758868} ``` I know this is an old thread, but the solution seems to have a bug and it's the most voted one.Fransen
@Petr: Thanks for the bug report; clearly I never actually tested this since I was adding the cross products to the bottom half, not the top! Fixed now, and updated with better critical-path latency.Cerumen
@Zboson: FYI, if you were using this code anywhere, it was broken for numbers with non-zero upper halves. xD. The fix didn't cost any performance, just a wrong shift direction and mask constant. I assume if anyone had been using this code, they noticed the bug and fixed it locally but didn't comment. :/Cerumen
@PeterCordes I have used a different solution, but tried this code so I could compare. I think that using _mm256_mul_epu32 3 times is not really worse on Intel hardware, because 32-bit multiplication has double the latency anyway, so I would just avoid it if I can.Fransen
@Petr: As my answer discusses, _mm256_mul_epu32 is slightly better for throughput (even on Intel) by reducing the amount of other shuffle instructions you need. It is indeed worse for latency, but out-of-order exec can hide that in cases like this question where it's not part of a much longer chain.Cerumen
@PeterCordes I would have to benchmark the difference - but that's sometimes tricky especially if you want to target multiple microarchitectures. However, one of the reasons I prefer another solution is that it doesn't need additional constants, which I always prefer as usually non-trivial SIMD code usually needs a lot of constants, so I always want to save registers whenever possible.Fransen
@Petr: Loading a constant outside a loop is usually pretty trivial, unless the load misses in cache. (0xffffffff00000000 is cheap to construct on the fly; GCC13 does that, although inefficiently.) I guess you're thinking about a case where you use this as part of a larger algorithm that also needs other constants? You could avoid the constant here by doing sumcross = ((hl >> 32) + lh) << 32, at the cost of more uops that compete for the same ports as multiply, and even worse critical path latency. I think my answer mentioned that possibility. That would also use fewer tmp regs.Cerumen
@PeterCordes That's true, my point was that in some cases there is a lot of registers already holding constants, and you always need registers that are used by the SIMD code. Having more registers holding constants is just something that sometimes makes the compiler either issue memory loads within the loop or to spill registers etc... So any approach that involves less constants or constants already used in other parts is always more beneficial, according to my experience with SIMD. This is BTW related to a more advanced SIMD use, not a trivial loop.Fransen
@Petr: Yes, I understood that. Composing multiple operations into one loop body, instead of bad loops with low computational intensity, is part of the point of writing in C rather than NumPy or something. But I didn't think it would be very common for integer SIMD to use up all 16 registers with constants + temporaries. (FP more often needs more unrolling to hide longer latencies.) Still, added a paragraph about avoiding the constant.Cerumen
@PeterCordes Unfortunately in AVX2 mode it's pretty easy to have all 16 registers used even in a pretty trivial integer code. If you use for example 2-4 SIMD registers in parallel to hide latencies of some instructions (like multiplications) and need some termporaries as well the available registers get exhausted quickly. AVX-512 solved that (pretty hard to exhaust 32 regs), but unfortunately AVX-512 is not present on Intel consumer hardware (and thus neither 64-bit multiply).Fransen
G
5

If you're interested in SIMD 64bx64b to 64b (lower) operations here are the AVX and AVX2 solutions from Agner Fog's Vector Class Library. I would test these with arrays and see how it compares to what GCC does with a generic loop such as the one in Peter Cordes' answer.

AVX (use SSE - you can still compile with -mavx to get vex encoding).

// vector operator * : multiply element by element
static inline Vec2q operator * (Vec2q const & a, Vec2q const & b) {
#if INSTRSET >= 5   // SSE4.1 supported
    // instruction does not exist. Split into 32-bit multiplies
    __m128i bswap   = _mm_shuffle_epi32(b,0xB1);           // b0H,b0L,b1H,b1L (swap H<->L)
    __m128i prodlh  = _mm_mullo_epi32(a,bswap);            // a0Lb0H,a0Hb0L,a1Lb1H,a1Hb1L, 32 bit L*H products
    __m128i zero    = _mm_setzero_si128();                 // 0
    __m128i prodlh2 = _mm_hadd_epi32(prodlh,zero);         // a0Lb0H+a0Hb0L,a1Lb1H+a1Hb1L,0,0
    __m128i prodlh3 = _mm_shuffle_epi32(prodlh2,0x73);     // 0, a0Lb0H+a0Hb0L, 0, a1Lb1H+a1Hb1L
    __m128i prodll  = _mm_mul_epu32(a,b);                  // a0Lb0L,a1Lb1L, 64 bit unsigned products
    __m128i prod    = _mm_add_epi64(prodll,prodlh3);       // a0Lb0L+(a0Lb0H+a0Hb0L)<<32, a1Lb1L+(a1Lb1H+a1Hb1L)<<32
    return  prod;
#else               // SSE2
    int64_t aa[2], bb[2];
    a.store(aa);                                           // split into elements
    b.store(bb);
    return Vec2q(aa[0]*bb[0], aa[1]*bb[1]);                // multiply elements separetely
#endif
}

AVX2

// vector operator * : multiply element by element
static inline Vec4q operator * (Vec4q const & a, Vec4q const & b) {
    // instruction does not exist. Split into 32-bit multiplies
    __m256i bswap   = _mm256_shuffle_epi32(b,0xB1);           // swap H<->L
    __m256i prodlh  = _mm256_mullo_epi32(a,bswap);            // 32 bit L*H products
    __m256i zero    = _mm256_setzero_si256();                 // 0
    __m256i prodlh2 = _mm256_hadd_epi32(prodlh,zero);         // a0Lb0H+a0Hb0L,a1Lb1H+a1Hb1L,0,0
    __m256i prodlh3 = _mm256_shuffle_epi32(prodlh2,0x73);     // 0, a0Lb0H+a0Hb0L, 0, a1Lb1H+a1Hb1L
    __m256i prodll  = _mm256_mul_epu32(a,b);                  // a0Lb0L,a1Lb1L, 64 bit unsigned products
    __m256i prod    = _mm256_add_epi64(prodll,prodlh3);       // a0Lb0L+(a0Lb0H+a0Hb0L)<<32, a1Lb1L+(a1Lb1H+a1Hb1L)<<32
    return  prod;
}

These functions work for signed and unsigned 64-bit integers. In your case since q is constant within the loop you don't need to recalculate some things every iteration but your compiler will probably figure that out anyway.

Galacto answered 19/5, 2016 at 11:59 Comment(6)
Those are 9 fused-domain uops on Haswell/SKL (where [v]pmulld is 2 uops, vs. 1 on SnB). In the unfused domain: 4 shuffle uops (p5), 3 mul uops (p0), and 2 add uops (one p1-only, one p15). So with good scheduling, it runs at one per 4c on Haswell and Skylake. (vs. one per 5c on Haswell or one per 2.5c on Skylake.)Cerumen
Couldn't hadd/pshufd be replaced with something? Yeah, we can use shift/add/and. 1 shuffle uop (p5), 3 mul+1 shift (p0), 2 ADD (p15), 1 AND (p015). We could replace the shift with a pshufb to reduce p0 pressure. Generating the constant now takes 2 insns (so compilers will choose to load it from memory), but I'm not counting that or the xor, since it can be hoisted after inlining.Cerumen
With b as the same-every-time operand, yeah it can hoist the bswap shuffle. Interesting point. Everything else depends on a*b, though, so that's all. Make sure you use b=q, not a=q.Cerumen
edited this into my answer. Thanks again for suggesting a different algorithm that can be optimized better. I didn't take the time to consider anything but gcc's autovectorized output.Cerumen
@PeterCordes, cool that you improved Agner's solution. I would send Agner an email with your suggestion. If he likes it he probably will put it into the next version of his vector class library.Galacto
Thank you. I will put this in the next update of my Vector class librarySpheroidal

© 2022 - 2024 — McMap. All rights reserved.