Save time with parallel FOR loop
Asked Answered
K

5

58

I have a question concerning parallel for loops. I have the following code:

    public static void MultiplicateArray(double[] array, double factor)
    {
        for (int i = 0; i < array.Length; i++)
        {
            array[i] = array[i] * factor;
        }
    }

    public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
    {
        for (int i = 0; i < arrayToChange.Length; i++)
        {
            arrayToChange[i] = arrayToChange[i] * multiplication[i];
        }
    }

    public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
    {
        for (int i = 0; i < arrayToChange.Length; i++)
        {
            arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
        }
    }

Now I try to add parallel function:

    public static void MultiplicateArray(double[] array, double factor)
    {
        Parallel.For(0, array.Length, i =>
            {
                array[i] = array[i] * factor;
            });
    }

    public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
    {
        Parallel.For(0, arrayToChange.Length, i =>
        {
            arrayToChange[i] = arrayToChange[i] * multiplication[i];
        });
    }

    public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
    {
        Parallel.For(0, arrayToChange.Length, i =>
        {
            arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
        });
    }

The issue is, that I want to save time, not to waste it. With the standard for loop it computes about 2 minutes, but with the parallel for loop it takes 3 min. Why?

Klingel answered 13/9, 2012 at 12:9 Comment(4)
How big are your arrays? They must be huge if code like this takes several minutes. Isn't there some other way to parallelize the code?Pugging
What is your hardware ? In addition, Parallel tasks are very efficient for I/O operations (write to disk, call a web service, ...) not for "workers" tasks (CPU intensive)Averil
@Averil I think you're wrong. Parallelizing CPU-intensive tasks can be very effective (if you have multicore computer). On the other hand, parallelizing writing to a disk will most likely degrade performance, not improve, because you are forcing the disk to constantly seek between several positions.Pugging
@Pugging : yes of course ! that's why my question on the hardware. Disk IO is maybe a poor example but if your files were in different locations, on different network shares, or on different physical hard drives, then parallel taks will probably help. I just wanted to do analogy with worker/backgroudn threads.Averil
P
75

Parallel.For() can improve performance a lot by parallelizing your code, but it also has overhead (synchronization between threads, invoking the delegate on each iteration). And since in your code, each iteration is very short (basically, just a few CPU instructions), this overhead can become prominent.

Because of this, I thought using Parallel.For() is not the right solution for you. Instead, if you parallelize your code manually (which is very simple in this case), you may see the performance improve.

To verify this, I performed some measurements: I ran different implementations of MultiplicateArray() on an array of 200 000 000 items (the code I used is below). On my machine, the serial version consistently took 0.21 s and Parallel.For() usually took something around 0.45 s, but from time to time, it spiked to 8–9 s!

First, I'll try to improve the common case and I'll come to those spikes later. We want to process the array by N CPUs, so we split it into N equally sized parts and process each part separately. The result? 0.35 s. That's still worse than the serial version. But for loop over each item in an array is one of the most optimized constructs. Can't we do something to help the compiler? Extracting computing the bound of the loop could help. It turns out it does: 0.18 s. That's better than the serial version, but not by much. And, interestingly, changing the degree of parallelism from 4 to 2 on my 4-core machine (no HyperThreading) doesn't change the result: still 0.18 s. This makes me conclude that the CPU is not the bottleneck here, memory bandwidth is.

Now, back to the spikes: my custom parallelization doesn't have them, but Parallel.For() does, why? Parallel.For() does use range partitioning, which means each thread processes its own part of the array. But, if one thread finishes early, it will try to help processing the range of another thread that hasn't finished yet. If that happens, you will get a lot of false sharing, which could slow down the code a lot. And my own test with forcing false sharing seems to indicate this could indeed be the problem. Forcing the degree of parallelism of the Parallel.For() seems to help with the spikes a little.

Of course, all those measurements are specific to the hardware on my computer and will be different for you, so you should make your own measurements.

The code I used:

static void Main()
{
    double[] array = new double[200 * 1000 * 1000];

    for (int i = 0; i < array.Length; i++)
        array[i] = 1;

    for (int i = 0; i < 5; i++)
    {
        Stopwatch sw = Stopwatch.StartNew();
        Serial(array, 2);
        Console.WriteLine("Serial: {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        ParallelFor(array, 2);
        Console.WriteLine("Parallel.For: {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        ParallelForDegreeOfParallelism(array, 2);
        Console.WriteLine("Parallel.For (degree of parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallel(array, 2);
        Console.WriteLine("Custom parallel: {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallelExtractedMax(array, 2);
        Console.WriteLine("Custom parallel (extracted max): {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallelExtractedMaxHalfParallelism(array, 2);
        Console.WriteLine("Custom parallel (extracted max, half parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);

        sw = Stopwatch.StartNew();
        CustomParallelFalseSharing(array, 2);
        Console.WriteLine("Custom parallel (false sharing): {0:f2} s", sw.Elapsed.TotalSeconds);
    }
}

static void Serial(double[] array, double factor)
{
    for (int i = 0; i < array.Length; i++)
    {
        array[i] = array[i] * factor;
    }
}

static void ParallelFor(double[] array, double factor)
{
    Parallel.For(
        0, array.Length, i => { array[i] = array[i] * factor; });
}

static void ParallelForDegreeOfParallelism(double[] array, double factor)
{
    Parallel.For(
        0, array.Length, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
        i => { array[i] = array[i] * factor; });
}

static void CustomParallel(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount;

    var tasks = new Task[degreeOfParallelism];

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        // capturing taskNumber in lambda wouldn't work correctly
        int taskNumberCopy = taskNumber;

        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
                    i < array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
                    i++)
                {
                    array[i] = array[i] * factor;
                }
            });
    }

    Task.WaitAll(tasks);
}

static void CustomParallelExtractedMax(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount;

    var tasks = new Task[degreeOfParallelism];

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        // capturing taskNumber in lambda wouldn't work correctly
        int taskNumberCopy = taskNumber;

        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
                for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
                    i < max;
                    i++)
                {
                    array[i] = array[i] * factor;
                }
            });
    }

    Task.WaitAll(tasks);
}

static void CustomParallelExtractedMaxHalfParallelism(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount / 2;

    var tasks = new Task[degreeOfParallelism];

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        // capturing taskNumber in lambda wouldn't work correctly
        int taskNumberCopy = taskNumber;

        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
                for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
                    i < max;
                    i++)
                {
                    array[i] = array[i] * factor;
                }
            });
    }

    Task.WaitAll(tasks);
}

static void CustomParallelFalseSharing(double[] array, double factor)
{
    var degreeOfParallelism = Environment.ProcessorCount;

    var tasks = new Task[degreeOfParallelism];

    int i = -1;

    for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
    {
        tasks[taskNumber] = Task.Factory.StartNew(
            () =>
            {
                int j = Interlocked.Increment(ref i);
                while (j < array.Length)
                {
                    array[j] = array[j] * factor;
                    j = Interlocked.Increment(ref i);
                }
            });
    }

    Task.WaitAll(tasks);
}

Example output:

Serial: 0,20 s
Parallel.For: 0,50 s
Parallel.For (degree of parallelism): 8,90 s
Custom parallel: 0,33 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,53 s
Serial: 0,21 s
Parallel.For: 0,52 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,19 s
Custom parallel (false sharing): 7,59 s
Serial: 0,21 s
Parallel.For: 11,21 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,32 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,76 s
Serial: 0,21 s
Parallel.For: 0,46 s
Parallel.For (degree of parallelism): 0,35 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
Serial: 0,21 s
Parallel.For: 0,45 s
Parallel.For (degree of parallelism): 0,40 s
Custom parallel: 0,38 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
Pugging answered 13/9, 2012 at 13:42 Comment(5)
BTW, a similar speedup as with manual parallelization could be probably achieved using Parallel.ForEach() and Paritioner.Create(). With that, each iteration of Parallel.ForEach() gets range of indexes, which is likely going to be much faster than normal Parallel.For().Pugging
indeed your extracted max was the fastest unlike ForEach actually i have done tests using primitives calculations on valueTypes unsafe parallel... x64 release. try and see how far it is from allSubmiss
did you still get the long delays if you set your array as fixed, so you wont have the garbage collector running trough it. ? I do graphics in c#, without parallelism i wouldnt have a job.Peraea
@user3800527 It's a single array of doubles. The GC has pretty much nothing to do there.Pugging
there must be something wrong with your pc, you cant have code that sometimes takes 8 to 9 sec while other times 0.45 sec and blame it on parallelism.Peraea
C
20

Svick already provided a great answer but I'd like to emphasize that the key point is not to "parallelize your code manually" instead of using Parallel.For() but that you have to process larger chunks of data.

This can still be done using Parallel.For() like this:

static void My(double[] array, double factor)
{
    int degreeOfParallelism = Environment.ProcessorCount;

    Parallel.For(0, degreeOfParallelism, workerId =>
    {
        var max = array.Length * (workerId + 1) / degreeOfParallelism;
        for (int i = array.Length * workerId / degreeOfParallelism; i < max; i++)
            array[i] = array[i] * factor;
    });
}

which does the same thing as svicks CustomParallelExtractedMax() but is shorter, simpler and (on my machine) performs even slightly faster:

Serial: 3,94 s
Parallel.For: 9,28 s
Parallel.For (degree of parallelism): 9,58 s
Custom parallel: 2,05 s
Custom parallel (extracted max): 1,19 s
Custom parallel (extracted max, half parallelism): 1,49 s
Custom parallel (false sharing): 27,88 s
My: 0,95 s

Btw, the keyword for this which is missing from all the other answers is granularity.

Crafton answered 28/4, 2014 at 9:22 Comment(6)
You checked in debug mode didn't you ?Pung
I don't know. It's been over a year ago. Why? Do you get different results?Crafton
I get quite the same in debug mode and when I select release mode your function get the same values as Serial version. When you do benchmark I advice you to always use Release mode because it optimize as it'll be irl ;)Pung
I know :) But it's quite possible that i didn't know (or forgot) back then. I'll see if I can repeat the benchmark and update the numbers.Crafton
Thanks to the unknown upvoter for reminding me I had to redo the benchmarks ;)Crafton
@Thomas: I redid the benchmark today using Release mode and the results are pretty much the same.Crafton
C
11

See Custom Partitioners for PLINQ and TPL:

In a For loop, the body of the loop is provided to the method as a delegate. The cost of invoking that delegate is about the same as a virtual method call. In some scenarios, the body of a parallel loop might be small enough that the cost of the delegate invocation on each loop iteration becomes significant. In such situations, you can use one of the Create overloads to create an IEnumerable<T> of range partitions over the source elements. Then, you can pass this collection of ranges to a ForEach method whose body consists of a regular for loop. The benefit of this approach is that the delegate invocation cost is incurred only once per range, rather than once per element.

In your loop body, you are performing a single multiplication, and the overhead of the delegate call will be very noticeable.

Try this:

public static void MultiplicateArray(double[] array, double factor)
{
    var rangePartitioner = Partitioner.Create(0, array.Length);

    Parallel.ForEach(rangePartitioner, range =>
    {
        for (int i = range.Item1; i < range.Item2; i++)
        {
            array[i] = array[i] * factor;
        }
    });
}

See also: Parallel.ForEach documentation and Partitioner.Create documentation.

Calque answered 6/2, 2014 at 21:55 Comment(0)
T
6

Parallel.For involves more complex memory management. That result could vary depending on cpu specs, like #cores, L1 & L2 cache...

Please take a look to this interesting article:

http://msdn.microsoft.com/en-us/magazine/cc872851.aspx

Teddman answered 13/9, 2012 at 13:25 Comment(3)
I don't think false sharing is the main problem here. Parallel.For() uses range partitioning, which avoids this problem.Pugging
Okay, I take what I said partially back. False sharing can be a problem, especially if you don't limit the degree of parallelism here, because then range partitioning doesn't really work.Pugging
On the same subject blogs.msdn.com/b/pfxteam/archive/2009/05/26/9641563.aspxDalmatic
D
-6

from http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.aspx and http://msdn.microsoft.com/en-us/library/dd537608.aspx

you are not creating three thread/process that execute your for, but the iteration of the for is tryed to be executet in parallel, so even with only one for you are using multiple thread/process.

this mean that interation with index = 0 and index = 1 may be executed at the same time.

Probabily you are forcing to use too much thread/process, and the overhead for the creation/execution of them is bigger that the speed gain.

Try to use three normal for but in three different thread/process, if your sistem is multicore (3x at least) it should take less than one minute

Draff answered 13/9, 2012 at 13:33 Comment(2)
I'm not sure what are you trying to say, but Parallel.For() does use multiple threads (processes are something different).Pugging
Parallel.for divide your "for" in one or more "for" in different thread.He is trying to run three "for" in three different thread.Draff

© 2022 - 2024 — McMap. All rights reserved.