You seem to be assuming long
is 64bits in your code, but then using __uint64_t
as well. In 32bit, the x32 ABI, and on Windows, long
is a 32bit type. Your title mentions long long
, but then your code ignores it. I was wondering for a while if your code was assuming that long
was 32bit.
You're completely shooting yourself in the foot by using AVX256 loads but then aliasing a pointer onto the __m256i
to do scalar operations. gcc just gives up and gives you the terrible code you asked for: vector load and then a bunch of extract
and insert
instructions. Your way of writing it means that both vectors have to be unpacked to do the sub
in scalar as well, instead of using vpsubq
.
Modern x86 CPUs have very fast L1 cache that can handle two operations per clock. (Haswell and later: two loads and one store per clock). Doing multiple scalar loads from the same cache line is better than a vector load and unpacking. (Imperfect uop scheduling reduces the throughput to about 84% of that, though: see below)
gcc 5.3 -O3 -march=haswell (Godbolt compiler explorer) auto-vectorizes a simple scalar implementation pretty well. When AVX2 isn't available, gcc foolishly still auto-vectorizes with 128b vectors: On Haswell, this will actually be about 1/2 the speed of ideal scalar 64bit code. (See the perf analysis below, but substitute 2 elements per vector instead of 4).
#include <stdint.h> // why not use this like a normal person?
#define BASE_VEX_STOP 1024
#define BASE_DIMENSION 1028
// restrict lets the compiler know the arrays don't overlap,
// so it doesn't have to generate a scalar fallback case
void multiply_simple(uint64_t *restrict Gi_vec, uint64_t q, const uint64_t *restrict Gj_vec){
for (intptr_t i=0; i<BASE_DIMENSION; i++) // gcc doesn't manage to optimize away the sign-extension from 32bit to pointer-size in the scalar epilogue to handle the last less-than-a-vector elements
Gi_vec[i] -= Gj_vec[i] * q;
}
inner loop:
.L4:
vmovdqu ymm1, YMMWORD PTR [r9+rax] # MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B], MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B]
add rcx, 1 # ivtmp.30,
vpsrlq ymm0, ymm1, 32 # tmp174, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B],
vpmuludq ymm2, ymm1, ymm3 # tmp173, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B], vect_cst_.25
vpmuludq ymm0, ymm0, ymm3 # tmp176, tmp174, vect_cst_.25
vpmuludq ymm1, ymm4, ymm1 # tmp177, tmp185, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B]
vpaddq ymm0, ymm0, ymm1 # tmp176, tmp176, tmp177
vmovdqa ymm1, YMMWORD PTR [r8+rax] # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B]
vpsllq ymm0, ymm0, 32 # tmp176, tmp176,
vpaddq ymm0, ymm2, ymm0 # vect__13.24, tmp173, tmp176
vpsubq ymm0, ymm1, ymm0 # vect__14.26, MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__13.24
vmovdqa YMMWORD PTR [r8+rax], ymm0 # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__14.26
add rax, 32 # ivtmp.32,
cmp rcx, r10 # ivtmp.30, bnd.14
jb .L4 #,
Translate that back to intrinsics if you want, but it's going to be a lot easier to just let the compiler autovectorize. I didn't try to analyse it to see if it's optimal.
If you don't usually compile with -O3
, you could use #pragma omp simd
before the loop (and -fopenmp
).
Of course, instead of a scalar epilogue, it would prob. be faster to do an unaligned load of the last 32B of Gj_vec, and store into the last 32B of Gi_vec, potentially overlapping with the last store from the loop. (A scalar fallback is still needed if the arrays are smaller than 32B.)
GCC13 and Clang17 still both vectorize this way, using another two vpmuludq
instead of 1x vpmulld
to get both 32-bit cross products done with slightly less shuffling. But they are able to optimize away the insert/extract in the OP's pointer-aliasing ptr_I[0] -= ptr_J[0] * q;
, so that compiles to the same asm loop as the simple scalar version.
Improved vector intrinsic version for AVX2
From my comments on Z Boson's answer. Based on Agner Fog's vector class library code, with various improvements in 2023 along with a bugfix. (For 8 years, nobody noticed that I was adding the cross products to the bottom of the full result, not to the top half to match their place-values; thanks @Petr for actually testing this.)
Agner Fog's version saves an instruction but bottlenecks on the shuffle port by using phadd
+ pshufd
where I use psllq
/ pand
/ paddd
. vpmulld
to get both cross products packed together is 2 uops on Haswell and later Intel (same as two separate vpmuludq
instructions but with less shuffling of inputs), but only 1 uops for either of two ports on Zen 3 and later AMD.
Since one of your operands is constant, make sure to pass set1(q)
as b
, not a
, so the "b_swap" shuffle can be hoisted.
// replace hadd -> shuffle (4 uops) with shift/and/add (3 uops with less shuffle-port pressure)
// The constant takes 2 insns to generate outside a loop.
__m256i mul64_avx2 (__m256i a, __m256i b)
{
// There is no vpmullq until AVX-512. Split into 32-bit multiplies
// Given a and b composed of high<<32 | low 32-bit halves
// a*b = a_low*(u64)b_low + (u64)(a_high*b_low + a_low*b_high)<<32; // same for signed or unsigned a,b since we aren't widening to 128
// the a_high * b_high product isn't needed for non-widening; its place value is entirely outside the low 64 bits.
__m256i b_swap = _mm256_shuffle_epi32(b, _MM_SHUFFLE(2,3, 0,1)); // swap H<->L
__m256i crossprod = _mm256_mullo_epi32(a, b_swap); // 32-bit L*H and H*L cross-products
__m256i prodlh = _mm256_slli_epi64(crossprod, 32); // bring the low half up to the top of each 64-bit chunk
__m256i prodhl = _mm256_and_si256(crossprod, _mm256_set1_epi64x(0xFFFFFFFF00000000)); // isolate the other, also into the high half were it needs to eventually be
__m256i sumcross = _mm256_add_epi32(prodlh, prodhl); // the sum of the cross products, with the low half of each u64 being 0.
__m256i prodll = _mm256_mul_epu32(a,b); // widening 32x32 => 64-bit low x low products
__m256i prod = _mm256_add_epi32(prodll, sumcross); // add the cross products into the high half of the result
return prod;
}
This has been tested and now works, including for values with non-zero high and low halves. See it on Godbolt.
Note that this doesn't include the final subtract from the question, only the multiply.
This version should perform a bit better on Haswell than gcc's autovectorized version. (Like maybe one vector per 4 cycles instead of one vector per 5 cycles, bottlenecked on port0 throughput. Replacing vpsllq
(srli_epi64
) with vpshufb
with a vector constant, or vpslldq
by 4 and doing the masking after adding, can get that port 0 bottleneck down to 3 cycles.) Either way is fine on later CPUs like Skylake and Rocket Lake, where https://uica.uops.info/ estimates a throughput bottleneck of 2.67 based on execution-port pressure (for just the multiply without a loop or subtract), with perfect scheduling. I went with the shift version so it didn't need an extra vector constant, and let the shift/and run in parallel for better critical-path latency in case this is part of a dependency chain, unlike in the question where we don't do much else with each element between load and store, so out-of-order exec doesn't have to work hard to overlap between iterations.
We can avoid vector constants entirely by doing sumcross = ((crossprod >> 32) + crossprod) << 32
. That still takes 3 instructions, but the AND becomes a shift, so it's another uop that competes with multiplies (on Intel P-cores). And it has worse critical-path latency. But it avoids the constant, and uses fewer temporary registers (for the same reason it has worse ILP and higher latency), so could be useful if multiplying as part of a larger loop body that would otherwise run out of registers.
An AVX1 or SSE4.1 version (two elements per vector) would not be faster than 64-bit scalar imul r64, [mem]
. Don't do it unless you already have your data in vectors, and want the result in a vector. (It might be better than extracting to scalar and back.) Or on a CPU that has slow imul r64, r64
like perhaps Silvermont-family. In that case _mm_add_epi32
is faster than _mm_add_epi64
.
add_epi64
isn't slower or larger code-size on any AVX2 CPUs, but the SSE version was on some earlier CPUs. I used add_epi32
when possible because it won't be slower, and could possibly save some power or perhaps have lower latency on some hypothetical future CPU. According to https://uops.info/, Intel since Haswell and AMD since Zen 1 run vpaddd
identically to vpaddq
, including Alder Lake E-cores.
Perf analysis of GCC's autovectorized code (not the intrinsic version)
Background: see Agner Fog's insn tables and microarch guide, and other links in the x86 tag wiki.
Until AVX512 (see below), this is probably only barely faster than scalar 64bit code: imul r64, m64
has a throughput of one per clock on Intel CPUs (but one per 4 clocks on AMD Bulldozer-family). load/imul/sub-with-memory-dest is 4 fused-domain uops on Intel CPUs (with an addressing mode that can micro-fuse, which gcc fails to use). The pipeline width is 4 fused-domain uops per clock, so even a large unroll can't get this to issue at one-per-clock. With enough unrolling, we'll bottleneck on load/store throughput. 2 loads and one store per clock is possible on Haswell, but stores-address uops stealing load ports will lower the throughput to about 81/96 = 84% of that, according to Intel's manual.
So perhaps the best way for Haswell would load and multiply with scalar, (2 uops), then vmovq
/ pinsrq
/ vinserti128
so you can do the subtract with a vpsubq
. That's 8 uops to load&multiply all 4 scalars, 7 shuffle uops to get the data into a __m256i (2 (movq) + 4 (pinsrq is 2 uops) + 1 vinserti128), and 3 more uops to do a vector load / vpsubq / vector store. So that's 18 fused-domain uops per 4 multiplies (4.5 cycles to issue), but 7 shuffle uops (7 cycles to execute). So nvm, this is no good compared to pure scalar.
The autovectorized code is using 8 vector ALU instructions for each vector of four values. On Haswell, 5 of those uops (multiplies and shifts) can only run on port 0, so no matter how you unroll this algorithm it will achieve at best one vector per 5 cycles (i.e. one multiply per 5/4 cycles.)
The shifts could be replaced with pshufb
(port 5) to move the data and shift in zeros. (Other shuffles don't support zeroing instead of copying a byte from the input, and there aren't any known-zeros in the input that we could copy.)
paddq
/ psubq
can run on ports 1/5 on Haswell, or p015 on Skylake.
Skylake runs pmuludq
and immediate-count vector shifts on on p01, so it could in theory manage a throughput of one vector per max(5/2, 8/3, 11/4) = 11/4 = 2.75 cycles. So it bottlenecks on total fused-domain uop throughput (including the 2 vector loads and 1 vector store). So a bit of loop unrolling will help. Probably resource conflicts from imperfect scheduling will bottleneck it to slightly less than 4 fused-domain uops per clock. The loop overhead can hopefully run on port 6, which can only handle some scalar ops, including add
and compare-and-branch, leaving ports 0/1/5 for vector ALU ops, since they're close to saturating (8/3 = 2.666 clocks). The load/store ports are nowhere near saturating, though.
So, Skylake can theoretically manage one vector per 2.75 cycles (plus loop overhead), or one multiply per ~0.7 cycles with GCC auto-vectorization, vs. Haswell's best option (one per ~1.2 cycles in theory with scalar, or one per 1.25 cycles in theory with vectors). The scalar one per ~1.2 cycles would probably require a hand-tuned asm loop, though, because compilers don't know how to use a one-register addressing mode for stores, and a two-register addressing mode for loads (dst + (src-dst)
and increment dst
).
Also, if your data isn't hot in L1 cache, getting the job done with fewer instructions lets the frontend get ahead of the execution units, and get started on the loads before the data is needed. Hardware prefetch doesn't cross page lines, so a vector loop will probably beat scalar in practice for large arrays, and maybe even for smaller arrays.
AVX-512DQ introduces a 64bx64b->64b vector multiply
GCC can auto-vectorize using it if you add -mavx512dq
. (In practice use -march=x86-64-v4
, or -march=native
or -march=skylake-avx512
or whatever to enable other stuff and set turning options.)
.L4:
vmovdqu64 zmm0, ZMMWORD PTR [r8+rax] # vect__11.23, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B]
add rcx, 1 # ivtmp.30,
vpmullq zmm1, zmm0, zmm2 # vect__13.24, vect__11.23, vect_cst_.25
vmovdqa64 zmm0, ZMMWORD PTR [r9+rax] # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B]
vpsubq zmm0, zmm0, zmm1 # vect__14.26, MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__13.24
vmovdqa64 ZMMWORD PTR [r9+rax], zmm0 # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__14.26
add rax, 64 # ivtmp.32,
cmp rcx, r10 # ivtmp.30, bnd.14
jb .L4 #,
So AVX512DQ (expected to be part of Skylake multi-socket Xeon (Purley) in ~2017) will give a much larger than 2x speedup (from wider vectors) if these instructions are pipelined at one per clock.
Update: Skylake-AVX512 (aka SKL-X or SKL-SP) runs VPMULLQ at one per 1.5 cycles for xmm, ymm, or zmm vectors. It's 3 uops with 15c latency. (With maybe an extra 1c of latency for the zmm version, if that's not a measurement glitch in the AIDA results.)
vpmullq
is much faster than anything you can build out of 32-bit chunks, so it's very much worth having an instruction for this even if current CPUs don't have 64-bit-element vector-multiply hardware. (Presumably they use the mantissa multipliers in the FMA units.)
Zen 4 runs vpmullq
as a single uop for either of 2 ports, so 2/clock throughput for 256-bit vectors, 1/clock for 512-bit vectors. Later Intel CPUs (like Alder Lake / Sapphire Rapids) still run it as 3 uops for ports 0/1, so 1.5c throughput. https://uops.info/
-=
). Anyway, on Haswell, it's potentially slower than an unrolled hand-tuned scalar asm loop (load/imul
/sub
-with-memory-dest), bottlenecking on load/store throughput after unrolling. With C intrinsics and compiler output, you're probably going to get faster code with vectors, though. – Cerumen__m256i
is defined with a "may alias" attribute, which is why it does what the user expects with gcc. You're absolutely right that it's not safe or a good idea in general. A union with a struct is a better idea. – Cerumen-=
by the C intrisic_mm256_sub_epi64
after the multiplication loop-unroll. Right now I haveptr_J[0] *= q;
. I get a speedup from this. @Zboson I'm using a Haswell microarchitecture, so I can use both. – Demodulation__int64_t
directly. Just#include <stdint.h>
and useint64_t
so your code is portable. Fortunately, the low 64bits result of a 64bx64b multiply is the same whether the input is signed or unsigned, so the code gives identical results either way. – Cerumen