Parallel Framework and avoiding false sharing
Asked Answered
A

1

12

Recently, I had answered a question about optimizing a likely parallelizable method for generation every permutation of arbitrary base numbers. I posted an answer similar to the Parallelized, poor implementation code block list, and someone nearly immediately pointed this out:

This is pretty much guaranteed to give you false sharing and will probably be many times slower. (credit to gjvdkamp)

and they were right, it was death slow. That said, I researched the topic, and found some interesting material and suggestions (archived MSDN magazine only, .NET Matters: False Sharing) for combating it. If I understand it correctly, when threads access contiguous memory (in say, the array that's likely backing that ConcurrentStack), false sharing likely occurs.


For code below the horizontal rule, a Bytes is:

struct Bytes {
  public byte A; public byte B; public byte C; public byte D;
  public byte E; public byte F; public byte G; public byte H;
}

For my own testing, I wanted to get a parallel version of this running and be genuinely faster, so I created a simple example based on the original code. 6 as limits[0] was a lazy choice on my part - my computer has 6 cores.

Single thread block Average run time: 10s0059ms

  var data = new List<Bytes>();
  var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };

  for (byte a = 0; a < limits[0]; a++)
  for (byte b = 0; b < limits[1]; b++)
  for (byte c = 0; c < limits[2]; c++)
  for (byte d = 0; d < limits[3]; d++)
  for (byte e = 0; e < limits[4]; e++)
  for (byte f = 0; f < limits[5]; f++)
  for (byte g = 0; g < limits[6]; g++)
  for (byte h = 0; h < limits[7]; h++)
    data.Add(new Bytes {
      A = a, B = b, C = c, D = d, 
      E = e, F = f, G = g, H = h
    });

Parallelized, poor implementation Run time avg: 81s729ms, ~ 8700 contentions

  var data = new ConcurrentStack<Bytes>();
  var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };

  Parallel.For(0, limits[0], (a) => {
    for (byte b = 0; b < limits[1]; b++)
    for (byte c = 0; c < limits[2]; c++)
    for (byte d = 0; d < limits[3]; d++)
    for (byte e = 0; e < limits[4]; e++)
    for (byte f = 0; f < limits[5]; f++)
    for (byte g = 0; g < limits[6]; g++)
    for (byte h = 0; h < limits[7]; h++)
      data.Push(new Bytes {
        A = (byte)a,B = b,C = c,D = d,
        E = e,F = f,G = g,H = h
      });
  }); 

Parallelized, ?? implementation Run time avg: 5s833ms, 92 contentions

  var data = new ConcurrentStack<List<Bytes>>();
  var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };

  Parallel.For (0, limits[0], () => new List<Bytes>(), 
    (a, loop, localList) => { 
      for (byte b = 0; b < limits[1]; b++)
      for (byte c = 0; c < limits[2]; c++)
      for (byte d = 0; d < limits[3]; d++)
      for (byte e = 0; e < limits[4]; e++)
      for (byte f = 0; f < limits[5]; f++)
      for (byte g = 0; g < limits[6]; g++)
      for (byte h = 0; h < limits[7]; h++)
        localList.Add(new Bytes {
          A = (byte)a, B = b, C = c, D = d,
          E = e, F = f, G = g, H = h
        });
      return localList;
  }, x => {
    data.Push(x);
  });

I'm glad that I had got an implementation that is faster than the single threaded version. I expected a result closer to around 10s / 6, or around 1.6 seconds, but that's probably a naive expectation.

My question is for the parallelized implementation that is actually faster than the single-threaded version, are there further optimizations that could be a applied to the operation? I'm wondering about optimizations related to parallelization, not improvements to the algorithm used to compute the values. Specifically:

  • I know about the optimization to store and populate as a struct instead of byte[], but it's not related to parallelization (or is it?)
  • I know that a desired value could be lazy evaluated with a ripple-carry adder, but same as the struct optimization.
Afteryears answered 23/4, 2015 at 5:3 Comment(6)
Are you better off posting this in programmers ? Better yet make 1 a Golf challengeElysium
@lloydm what's the problem of having this question in stackoverflow? It's great that there are at least some interesting, challenging questions here rather than just one million of error messages or syntax problemsPersecute
@Persecute There is no doubt it's interesting and challenging. I've learnt about false sharing. Having read the valid questions again I agree it ticks the boxes as a valid question.Elysium
Downvoter, how can I improve my question?Afteryears
Your implementation is also not optimal for List. You know exactly how many elements you need in the list, so you could set the capacity in the constructor and prevent unnecessary allocations.Pickar
@mikez That's a great optimization, it cut the run time average to 1s689ms average - that's likely the fastest it will get.Afteryears
A
2

First off, my initial assumption regarding Parallel.For() and Parallel.ForEach() was wrong.

The poor parallel implementation very likely has 6 threads all attempting to write to a single CouncurrentStack() at once. The good implementation usuing thread locals (explained more below) only accesses the shared variable once per task, nearly eliminating any contention.

When using Parallel.For() and Parallel.ForEach(), you cannot simply in-line replace a for or foreach loop with them. That's not to say it couldn't be a blind improvement, but without examining the problem and instrumenting it, using them is throwing multithreading at a problem because it might make it faster.

**Parallel.For() and Parallel.ForEach() has overloads that allow you to create a local state for the Task they ultimately create, and run an expression before and after each iteration's execution.

If you have an operation you parallelize with Parallel.For() or Parallel.ForEach(), it's likely a good idea to use this overload:

public static ParallelLoopResult For<TLocal>(
    int fromInclusive,
    int toExclusive,
    Func<TLocal> localInit,
    Func<int, ParallelLoopState, TLocal, TLocal> body,
    Action<TLocal> localFinally
)

For example, calling For() to sum all integers from 1 to 100,

var total = 0;

Parallel.For(0, 101, () => 0,  // <-- localInit
(i, state, localTotal) => { // <-- body
  localTotal += i;
  return localTotal;
}, localTotal => { <-- localFinally
  Interlocked.Add(ref total, localTotal);
});

Console.WriteLine(total);

localInit should be an lambda that initializes the local state type, which is passed to the body and localFinally lambdas. Please note I am not recommending implementing summing 1 to 100 using parallelization, but just have a simple example to make the example short.

Afteryears answered 25/4, 2015 at 6:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.