Parallel Framework and avoiding false sharing

Recently, I had answered a question about optimizing a likely parallelizable method for generation every permutation of arbitrary base numbers. I posted an answer similar to the Parallelized, poor implementation code block list, and someone nearly immediately pointed this out:

This is pretty much guaranteed to give you false sharing and will probably be many times slower. (credit to gjvdkamp)

and they were right, it was death slow. That said, I researched the topic, and found some interesting material and suggestions (archived MSDN magazine only, .NET Matters: False Sharing) for combating it. If I understand it correctly, when threads access contiguous memory (in say, the array that's likely backing that ConcurrentStack), false sharing likely occurs.

For code below the horizontal rule, a Bytes is:

struct Bytes {
  public byte A; public byte B; public byte C; public byte D;
  public byte E; public byte F; public byte G; public byte H;
}

For my own testing, I wanted to get a parallel version of this running and be genuinely faster, so I created a simple example based on the original code. 6 as limits[0] was a lazy choice on my part - my computer has 6 cores.

Single thread block Average run time: 10s0059ms

  var data = new List<Bytes>();
  var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };

  for (byte a = 0; a < limits[0]; a++)
  for (byte b = 0; b < limits[1]; b++)
  for (byte c = 0; c < limits[2]; c++)
  for (byte d = 0; d < limits[3]; d++)
  for (byte e = 0; e < limits[4]; e++)
  for (byte f = 0; f < limits[5]; f++)
  for (byte g = 0; g < limits[6]; g++)
  for (byte h = 0; h < limits[7]; h++)
    data.Add(new Bytes {
      A = a, B = b, C = c, D = d, 
      E = e, F = f, G = g, H = h
    });

Parallelized, poor implementation Run time avg: 81s729ms, ~ 8700 contentions

  var data = new ConcurrentStack<Bytes>();
  var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };

  Parallel.For(0, limits[0], (a) => {
    for (byte b = 0; b < limits[1]; b++)
    for (byte c = 0; c < limits[2]; c++)
    for (byte d = 0; d < limits[3]; d++)
    for (byte e = 0; e < limits[4]; e++)
    for (byte f = 0; f < limits[5]; f++)
    for (byte g = 0; g < limits[6]; g++)
    for (byte h = 0; h < limits[7]; h++)
      data.Push(new Bytes {
        A = (byte)a,B = b,C = c,D = d,
        E = e,F = f,G = g,H = h
      });
  });

Parallelized, ?? implementation Run time avg: 5s833ms, 92 contentions

  var data = new ConcurrentStack<List<Bytes>>();
  var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };

  Parallel.For (0, limits[0], () => new List<Bytes>(), 
    (a, loop, localList) => { 
      for (byte b = 0; b < limits[1]; b++)
      for (byte c = 0; c < limits[2]; c++)
      for (byte d = 0; d < limits[3]; d++)
      for (byte e = 0; e < limits[4]; e++)
      for (byte f = 0; f < limits[5]; f++)
      for (byte g = 0; g < limits[6]; g++)
      for (byte h = 0; h < limits[7]; h++)
        localList.Add(new Bytes {
          A = (byte)a, B = b, C = c, D = d,
          E = e, F = f, G = g, H = h
        });
      return localList;
  }, x => {
    data.Push(x);
  });

I'm glad that I had got an implementation that is faster than the single threaded version. I expected a result closer to around 10s / 6, or around 1.6 seconds, but that's probably a naive expectation.

My question is for the parallelized implementation that is actually faster than the single-threaded version, are there further optimizations that could be a applied to the operation? I'm wondering about optimizations related to parallelization, not improvements to the algorithm used to compute the values. Specifically:

I know about the optimization to store and populate as a struct instead of byte[], but it's not related to parallelization (or is it?)
I know that a desired value could be lazy evaluated with a ripple-carry adder, but same as the struct optimization.

First off, my initial assumption regarding Parallel.For() and Parallel.ForEach() was wrong.

The poor parallel implementation very likely has 6 threads all attempting to write to a single CouncurrentStack() at once. The good implementation usuing thread locals (explained more below) only accesses the shared variable once per task, nearly eliminating any contention.

When using Parallel.For() and Parallel.ForEach(), you cannot simply in-line replace a for or foreach loop with them. That's not to say it couldn't be a blind improvement, but without examining the problem and instrumenting it, using them is throwing multithreading at a problem because it might make it faster.

**Parallel.For() and Parallel.ForEach() has overloads that allow you to create a local state for the Task they ultimately create, and run an expression before and after each iteration's execution.

If you have an operation you parallelize with Parallel.For() or Parallel.ForEach(), it's likely a good idea to use this overload:

public static ParallelLoopResult For<TLocal>(
    int fromInclusive,
    int toExclusive,
    Func<TLocal> localInit,
    Func<int, ParallelLoopState, TLocal, TLocal> body,
    Action<TLocal> localFinally
)

For example, calling For() to sum all integers from 1 to 100,

var total = 0;

Parallel.For(0, 101, () => 0,  // <-- localInit
(i, state, localTotal) => { // <-- body
  localTotal += i;
  return localTotal;
}, localTotal => { <-- localFinally
  Interlocked.Add(ref total, localTotal);
});

Console.WriteLine(total);

localInit should be an lambda that initializes the local state type, which is passed to the body and localFinally lambdas. Please note I am not recommending implementing summing 1 to 100 using parallelization, but just have a simple example to make the example short.

Recommended topics

Hot tags