Recently, I had answered a question about optimizing a likely parallelizable method for generation every permutation of arbitrary base numbers. I posted an answer similar to the Parallelized, poor implementation code block list, and someone nearly immediately pointed this out:
This is pretty much guaranteed to give you false sharing and will probably be many times slower. (credit to gjvdkamp)
and they were right, it was death slow. That said, I researched the topic, and found some interesting material and suggestions (archived MSDN magazine only, .NET Matters: False Sharing) for combating it. If I understand it correctly, when threads access contiguous memory (in say, the array that's likely backing that ConcurrentStack
), false sharing likely occurs.
For code below the horizontal rule, a Bytes
is:
struct Bytes {
public byte A; public byte B; public byte C; public byte D;
public byte E; public byte F; public byte G; public byte H;
}
For my own testing, I wanted to get a parallel version of this running and be genuinely faster, so I created a simple example based on the original code. 6
as limits[0]
was a lazy choice on my part - my computer has 6 cores.
Single thread block Average run time: 10s0059ms
var data = new List<Bytes>();
var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };
for (byte a = 0; a < limits[0]; a++)
for (byte b = 0; b < limits[1]; b++)
for (byte c = 0; c < limits[2]; c++)
for (byte d = 0; d < limits[3]; d++)
for (byte e = 0; e < limits[4]; e++)
for (byte f = 0; f < limits[5]; f++)
for (byte g = 0; g < limits[6]; g++)
for (byte h = 0; h < limits[7]; h++)
data.Add(new Bytes {
A = a, B = b, C = c, D = d,
E = e, F = f, G = g, H = h
});
Parallelized, poor implementation Run time avg: 81s729ms, ~ 8700 contentions
var data = new ConcurrentStack<Bytes>();
var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };
Parallel.For(0, limits[0], (a) => {
for (byte b = 0; b < limits[1]; b++)
for (byte c = 0; c < limits[2]; c++)
for (byte d = 0; d < limits[3]; d++)
for (byte e = 0; e < limits[4]; e++)
for (byte f = 0; f < limits[5]; f++)
for (byte g = 0; g < limits[6]; g++)
for (byte h = 0; h < limits[7]; h++)
data.Push(new Bytes {
A = (byte)a,B = b,C = c,D = d,
E = e,F = f,G = g,H = h
});
});
Parallelized, ?? implementation Run time avg: 5s833ms, 92 contentions
var data = new ConcurrentStack<List<Bytes>>();
var limits = new byte[] { 6, 16, 16, 16, 32, 8, 8, 8 };
Parallel.For (0, limits[0], () => new List<Bytes>(),
(a, loop, localList) => {
for (byte b = 0; b < limits[1]; b++)
for (byte c = 0; c < limits[2]; c++)
for (byte d = 0; d < limits[3]; d++)
for (byte e = 0; e < limits[4]; e++)
for (byte f = 0; f < limits[5]; f++)
for (byte g = 0; g < limits[6]; g++)
for (byte h = 0; h < limits[7]; h++)
localList.Add(new Bytes {
A = (byte)a, B = b, C = c, D = d,
E = e, F = f, G = g, H = h
});
return localList;
}, x => {
data.Push(x);
});
I'm glad that I had got an implementation that is faster than the single threaded version. I expected a result closer to around 10s / 6, or around 1.6 seconds, but that's probably a naive expectation.
My question is for the parallelized implementation that is actually faster than the single-threaded version, are there further optimizations that could be a applied to the operation? I'm wondering about optimizations related to parallelization, not improvements to the algorithm used to compute the values. Specifically:
- I know about the optimization to store and populate as a
struct
instead ofbyte[]
, but it's not related to parallelization (or is it?) - I know that a desired value could be lazy evaluated with a ripple-carry adder, but same as the
struct
optimization.