Associativity gives us parallelizability. But what does commutativity give?
Asked Answered
S

2

18

Alexander Stepanov notes in one of his brilliant lectures at A9 (highly recommended, by the way) that the associative property gives us parallelizability – an extremely useful and important trait these days that the compilers, CPUs and programmers themselves can leverage:

// expressions in parentheses can be done in parallel
// because matrix multiplication is associative
Matrix X = (A * B) * (C * D);

But what, if anything, does the commutative property give us? Reordering? Out of order execution?

Supportive answered 16/2, 2016 at 21:32 Comment(7)
Great question!. The fact that the associative property gives parallelizability is an often overlooked fact. Many people don't realize that chain matrix multiplication can be done in parallel even though it does not commute. Also, it's rather ironic that floating point arithmetic is not associative which many people often overlook when they do floating point operations in parallel.Hectorhecuba
@Zboson Great note about floating point non-associativity - thanks!Supportive
Incidentally, OpenMP generally assumes that operations are both associative and commutative. For example when doing reductions. So if you have an operations that does not commute but is still associative you have to do a bit of hacking to use OpenMP. I think most people just give up when an operation does not commute and say it can't be done in parallel. See this question.Hectorhecuba
Floating point non-associativity is why you need -ffast-math to get auto-vectorization for C/C++. (or OpenMP or some other way to indicate that you're ok with something other than the exact order of operations in the source's looping).Pincushion
@PeterCordes, yes, good point. Although you only need -ffast-math for auto-vectorized reductions. Otherwise -O3 is sufficient. But with OpenMP you it assumes associative math for reductions even without -ffast-math. This is something many people don't realize (and then ask a question on SO to why their result with OpenMP is different). Another way to get auto-vectorization with GCC is to use #pragma omp simd then you only need -O2 even for reductions.Hectorhecuba
@Zboson: right, I should have said "many FP optimizations", because it helps for more than just auto-vectorizing reductions. It allows tranformations that enable common-subexpression elimination, potentially hoisting part of a calculation out of a loop. (Further compiler assumptions that are part of -ffast-math enable stuff like a[0..n] /= val; transforming to inv_val = 1/val; a[0..n] *= inv_val;). I'm sure you knew that, just writing it down here for the record.Pincushion
Turns out that I was able to find an operation that is not associative but can still be done in parallel. So associativity is not necessarily required for work sharing.Hectorhecuba
P
9

Some architectures, x86 being a prime example, have instructions where one of the sources is also the destination. If you still need the original value of the destination after the operation, you need an extra instruction to copy it to another register.

Commutative operations give you (or the compiler) a choice of which operand gets replaced with the result. So for example, compiling (with gcc 5.3 -O3 for x86-64 Linux calling convention):

// FP: a,b,c in xmm0,1,2.  return value goes in xmm0
// Intel syntax ASM is  op  dest, src
// sd means Scalar Double (as opposed to packed vector, or to single-precision)
double comm(double a, double b, double c) { return (c+a) * (c+b); }
    addsd   xmm0, xmm2
    addsd   xmm1, xmm2
    mulsd   xmm0, xmm1
    ret
double hard(double a, double b, double c) { return (c-a) * (c-b); }
    movapd  xmm3, xmm2    ; reg-reg copy: move Aligned Packed Double
    subsd   xmm2, xmm1
    subsd   xmm3, xmm0
    movapd  xmm0, xmm3
    mulsd   xmm0, xmm2
    ret
double easy(double a, double b, double c) { return (a-c) * (b-c); }
    subsd   xmm0, xmm2
    subsd   xmm1, xmm2
    mulsd   xmm0, xmm1
    ret

x86 also allows using memory operands as a source, so you can fold loads into ALU operations, like addsd xmm0, [my_constant]. (Using an ALU op with a memory destination sucks: it has to do a read-modify-write.) Commutative operations give more scope for doing this.

x86's extension (in Sandybridge, Jan 2011) added non-destructive versions of every existing instruction that used vector registers (same opcodes but with a multi-byte VEX prefix replacing all the previous prefixes and escape bytes). Other instruction-set extensions (like BMI/BMI2) also use the VEX coding scheme to introduce 3-operand non-destructive integer instructions, like PEXT r32a, r32b, r/m32: Parallel extract of bits from r32b using mask in r/m32. Result is written to r32a.

AVX also widened the vectors to 256b and added some new instructions. It's unfortunately nowhere near ubiquitous, and even Skylake Pentium/Celeron CPUs don't support it. It will be a long time before it's safe to ship binaries that assume AVX support. :(

Add -march=native to the compile options in the godbolt link above to see that AVX lets the compiler use just 3 instructions even for hard(). (godbolt runs on a Haswell server, so that includes AVX2 and BMI2):

double hard(double a, double b, double c) { return (c-a) * (c-b); }
        vsubsd  xmm0, xmm2, xmm0
        vsubsd  xmm1, xmm2, xmm1
        vmulsd  xmm0, xmm0, xmm1
        ret
Pincushion answered 17/2, 2016 at 0:58 Comment(2)
Do you mean that Skylake Pentium/Celeron CPUs don't support vex enconding? I thought maybe they had avx-128 but not avx-256. Do you mean they only have sse?Hectorhecuba
@Zboson: There's unfortunately no such thing as avx-128, AFAIK. There's no way for a CPU to indicate that it has VEX support without ymm registers. So yes, I mean they only have SSE4.2, like it says on ark.intel.com/products/88179/…. They don't have BMI/BMI2 either (perhaps related to lack of VEX support). Many people (myself included) only noticed this when Intel published an erratum about it erroneously showing as present in Skylake: software.intel.com/en-us/forums/intel-isa-extensions/topic/…Pincushion
H
13

Here is a more abstract answer with less emphasis on instruction level parallelism and more on thread level parallelism.

A common objective in parallelism is to do a reduction of information. A simple example is the dot product of two arrays

for(int i=0; i<N; i++) sum += x[i]*[y];

If the operation is associative then we can have each thread calculate a partial sum. Then the finally sum is the sum of each partial sum.

If the operation is commutative the final sum can be done in any order. Otherwise the partial sums have to be summed in order.

One problem is that we can't have multiple threads writing to the final sum at the same time otherwise it creates a race condition. So when one thread writes to the final sum the others have to wait. Therefore, summing in any order can be more efficient because it's often difficult to have each thread finish in order.


Let's choose an example. Let's say there are two threads and therefore two partial sums.

If the operation is commutative we could have this case

thread2 finishes its partial sum
sum += thread2's partial sum
thread2 finishes writing to sum   
thread1 finishes its partial sum
sum += thread1's partial sum

However if the operation does not commute we would have to do

thread2 finishes its partial sum
thread2 waits for thread1 to write to sum
thread1 finishes its partial sum
sum += thread1's partial sum
thread2 waits for thread1 to finish writing to sum    
thread1 finishes writing to sum   
sum += thread2's partial sum

Here is an example of the dot product with OpenMP

#pragma omp parallel for reduction(+: sum)
for(int i=0; i<N; i++) sum += x[i]*[y];

The reduction clause assumes the operation (+ in this case) is commutative. Most people take this for granted.

If the operation is not commutative we would have to do something like this

float sum = 0;
#pragma omp parallel
{
    float sum_partial = 0 
    #pragma omp for schedule(static) nowait
    for(int i=0; i<N; i++) sum_partial += x[i]*[y];
    #pragma omp for schedule(static) ordered
    for(int i=0; i<omp_get_num_threads(); i++) {
        #pragma omp ordered
        sum += sum_partial;
    }
}

The nowait clause tells OpenMP not to wait for each partial sum to finish. The ordered clause tells OpenMP to only write to sum in order of increasing thread number.

This method does the final sum linearly. However, it could be done in log2(omp_get_num_threads()) steps.

For example if we had four threads we could do the reduction in three sequential steps

  1. calculate four partial sums in parallel: s1, s2, s3, s4
  2. calculate in parallel: s5 = s1 + s2 with thread1 and s6 = s3 + s4 with thread2
  3. calculate sum = s5 + s6 with thread1

That's one advantage of using the reduction clause since it's a black box it may do the reduction in log2(omp_get_num_threads()) steps. OpenMP 4.0 allows defining custom reductions. But nevertheless it still assumes the operations are commutative. So it's not good for e.g. chain matrix multiplication. I'm not aware of an easy way with OpenMP to do the reduction in log2(omp_get_num_threads()) steps when the operations don't commute.

Hectorhecuba answered 18/2, 2016 at 11:17 Comment(8)
Nice answer, I hadn't thought of any other benefits besides saving mov instructions when an instruction or function-call cares which operand is which.Pincushion
Good note for the impact on a higher-level code indeed. I suspect something similar can be constructed for commutable std::futures - giving more degrees of freedom for compositions and continuations, etc.Supportive
@LeoHeinsaar, can you give an example of what you mean? My C++ is rusty and I think mostly like a C coder now.Hectorhecuba
I meant something like this: decomposing a computation into commutable tasks (more precisely: tasks with commutable execution results; execution results are modeled with std::futures in C++11) x = std::async(f) and y = std::async(g) will allow you to express concurrency using parallel constructs when_any(x, y) and when_all(x, y). Because if the execution results x and y are not commutable, you might have to use the (possibly equivalent) x.then(y) computation, which is not parallel with respect to x and y.Supportive
If you had operations that were associative but not commutative (like you're considering here with reductions), the smart thing would be to give each thread a separate output variable so they can all write their outputs in parallel. Then have the thread that wants the result collect those results and sum them in the right order. Not much different from having each thread atomically add to a shared variable in uncontrolled order, as far as performance goes, unless the number of threads is ridiculous.Pincushion
@PeterCordes, that's exactly what I did at the end of my answer with OpenMP. See sum_partial in the code.Hectorhecuba
On 2nd look a this, if you had more than 2 threads, I think arbitrary reordering of accumulating into the total requires associativity. Like (c + a) + b vs. (a + b) + c. It's only with exactly 2 results that commutativity is relevant. And then only if the sum=0 initializer is truly an identity element, so 0 + a or b is truly identical to a, including signed-zero semantics or whatever, so (0+a)+b and (0+b)+a reduce to a+b and b+a. (Easy to imagine non-IEEE FP where x+y takes the sign of the left operand for 0 + -0 instead of actual IEEE where either order is +0.)Pincushion
Oh, I see, if it's already associative to enable parallelization / vectorization. We're considering an operation that's associative but not commutative. Like matrix multiplication. The FP example had me mixed up because it's already an approximation for FP math to be associative, and I had just been looking at Is floating point addition commutative in C++? where the answers consider hypothetical non-IEEE FP math with non-commutative + (which would only be a minor difference, e.g. for signed zero).Pincushion
P
9

Some architectures, x86 being a prime example, have instructions where one of the sources is also the destination. If you still need the original value of the destination after the operation, you need an extra instruction to copy it to another register.

Commutative operations give you (or the compiler) a choice of which operand gets replaced with the result. So for example, compiling (with gcc 5.3 -O3 for x86-64 Linux calling convention):

// FP: a,b,c in xmm0,1,2.  return value goes in xmm0
// Intel syntax ASM is  op  dest, src
// sd means Scalar Double (as opposed to packed vector, or to single-precision)
double comm(double a, double b, double c) { return (c+a) * (c+b); }
    addsd   xmm0, xmm2
    addsd   xmm1, xmm2
    mulsd   xmm0, xmm1
    ret
double hard(double a, double b, double c) { return (c-a) * (c-b); }
    movapd  xmm3, xmm2    ; reg-reg copy: move Aligned Packed Double
    subsd   xmm2, xmm1
    subsd   xmm3, xmm0
    movapd  xmm0, xmm3
    mulsd   xmm0, xmm2
    ret
double easy(double a, double b, double c) { return (a-c) * (b-c); }
    subsd   xmm0, xmm2
    subsd   xmm1, xmm2
    mulsd   xmm0, xmm1
    ret

x86 also allows using memory operands as a source, so you can fold loads into ALU operations, like addsd xmm0, [my_constant]. (Using an ALU op with a memory destination sucks: it has to do a read-modify-write.) Commutative operations give more scope for doing this.

x86's extension (in Sandybridge, Jan 2011) added non-destructive versions of every existing instruction that used vector registers (same opcodes but with a multi-byte VEX prefix replacing all the previous prefixes and escape bytes). Other instruction-set extensions (like BMI/BMI2) also use the VEX coding scheme to introduce 3-operand non-destructive integer instructions, like PEXT r32a, r32b, r/m32: Parallel extract of bits from r32b using mask in r/m32. Result is written to r32a.

AVX also widened the vectors to 256b and added some new instructions. It's unfortunately nowhere near ubiquitous, and even Skylake Pentium/Celeron CPUs don't support it. It will be a long time before it's safe to ship binaries that assume AVX support. :(

Add -march=native to the compile options in the godbolt link above to see that AVX lets the compiler use just 3 instructions even for hard(). (godbolt runs on a Haswell server, so that includes AVX2 and BMI2):

double hard(double a, double b, double c) { return (c-a) * (c-b); }
        vsubsd  xmm0, xmm2, xmm0
        vsubsd  xmm1, xmm2, xmm1
        vmulsd  xmm0, xmm0, xmm1
        ret
Pincushion answered 17/2, 2016 at 0:58 Comment(2)
Do you mean that Skylake Pentium/Celeron CPUs don't support vex enconding? I thought maybe they had avx-128 but not avx-256. Do you mean they only have sse?Hectorhecuba
@Zboson: There's unfortunately no such thing as avx-128, AFAIK. There's no way for a CPU to indicate that it has VEX support without ymm registers. So yes, I mean they only have SSE4.2, like it says on ark.intel.com/products/88179/…. They don't have BMI/BMI2 either (perhaps related to lack of VEX support). Many people (myself included) only noticed this when Intel published an erratum about it erroneously showing as present in Skylake: software.intel.com/en-us/forums/intel-isa-extensions/topic/…Pincushion

© 2022 - 2024 — McMap. All rights reserved.