Some architectures, x86 being a prime example, have instructions where one of the sources is also the destination. If you still need the original value of the destination after the operation, you need an extra instruction to copy it to another register.
Commutative operations give you (or the compiler) a choice of which operand gets replaced with the result. So for example, compiling (with gcc 5.3 -O3
for x86-64 Linux calling convention):
// FP: a,b,c in xmm0,1,2. return value goes in xmm0
// Intel syntax ASM is op dest, src
// sd means Scalar Double (as opposed to packed vector, or to single-precision)
double comm(double a, double b, double c) { return (c+a) * (c+b); }
addsd xmm0, xmm2
addsd xmm1, xmm2
mulsd xmm0, xmm1
ret
double hard(double a, double b, double c) { return (c-a) * (c-b); }
movapd xmm3, xmm2 ; reg-reg copy: move Aligned Packed Double
subsd xmm2, xmm1
subsd xmm3, xmm0
movapd xmm0, xmm3
mulsd xmm0, xmm2
ret
double easy(double a, double b, double c) { return (a-c) * (b-c); }
subsd xmm0, xmm2
subsd xmm1, xmm2
mulsd xmm0, xmm1
ret
x86 also allows using memory operands as a source, so you can fold loads into ALU operations, like addsd xmm0, [my_constant]
. (Using an ALU op with a memory destination sucks: it has to do a read-modify-write.) Commutative operations give more scope for doing this.
x86's avx extension (in Sandybridge, Jan 2011) added non-destructive versions of every existing instruction that used vector registers (same opcodes but with a multi-byte VEX prefix replacing all the previous prefixes and escape bytes). Other instruction-set extensions (like BMI/BMI2) also use the VEX coding scheme to introduce 3-operand non-destructive integer instructions, like PEXT r32a, r32b, r/m32
: Parallel extract of bits from r32b using mask in r/m32. Result is written to r32a.
AVX also widened the vectors to 256b and added some new instructions. It's unfortunately nowhere near ubiquitous, and even Skylake Pentium/Celeron CPUs don't support it. It will be a long time before it's safe to ship binaries that assume AVX support. :(
Add -march=native
to the compile options in the godbolt link above to see that AVX lets the compiler use just 3 instructions even for hard()
. (godbolt runs on a Haswell server, so that includes AVX2 and BMI2):
double hard(double a, double b, double c) { return (c-a) * (c-b); }
vsubsd xmm0, xmm2, xmm0
vsubsd xmm1, xmm2, xmm1
vmulsd xmm0, xmm0, xmm1
ret
-ffast-math
to get auto-vectorization for C/C++. (or OpenMP or some other way to indicate that you're ok with something other than the exact order of operations in the source's looping). – Pincushion-ffast-math
for auto-vectorized reductions. Otherwise-O3
is sufficient. But with OpenMP you it assumes associative math for reductions even without-ffast-math
. This is something many people don't realize (and then ask a question on SO to why their result with OpenMP is different). Another way to get auto-vectorization with GCC is to use#pragma omp simd
then you only need-O2
even for reductions. – Hectorhecuba-ffast-math
enable stuff likea[0..n] /= val;
transforming toinv_val = 1/val; a[0..n] *= inv_val;
). I'm sure you knew that, just writing it down here for the record. – Pincushion