Why is my AsOrdered PLINQ query faster than my unordered one
Asked Answered
R

2

6

I wrote some basic sample code to familiarise myself with PLINQ.

I came across something weird. I don't know if it's an error in my code or an error in my understanding of PLINQ.

The MSDN documentation states that adding AsOrdered() will preserve the order of the call at the possible cost of performance.

I wrote some unit tests and noticed the effect on the order on the result set as stated in the documentation. But I have seen the inverse effect on performance.

Here are both my method:

public IEnumerable<int> ParallelCalculatePrimesUpTo(int maxValue)
{
    return from number in Enumerable.Range(1, maxValue).AsParallel()
            where IsPrime(number)
            select number;
}

public IEnumerable<int> OrderedParallelCalculatePrimesUpTo(int maxValue)
{
    return from number in Enumerable.Range(1, maxValue).AsParallel().AsOrdered()
            where IsPrime(number)
            select number;
}

And my very simple benchmarks

    [TestMethod]
    public void SimplisticBenchmark6()
    {
        var primeNumberCalculator = new PrimeNumberCalculator();

        var startTime = DateTime.Now;

        primeNumberCalculator.ParallelCalculatePrimesUpTo(10000000).ToList();

        var totalTime = DateTime.Now - startTime;

        Console.WriteLine(totalTime);
    }

    [TestMethod]
    public void SimplisticBenchmark7()
    {
        var primeNumberCalculator = new PrimeNumberCalculator();

        var startTime = DateTime.Now;

        primeNumberCalculator.OrderedParallelCalculatePrimesUpTo(10000000).ToList();

        var totalTime = DateTime.Now - startTime;

        Console.WriteLine(totalTime);
    }

No matter how often I run this test, the ordered version beats out the unordered one. I get about 4 seconds quicker for the ordered one on my quad core computer. I am getting about 18 seconds for the ordered one and 22 seconds for the unordered one. I have run the tests dozens of time over the course of two days (with reboots between those days).

If I lower the number 10 000 000 to 6 000 000, the differences is still there but less noticeable and if I lower it to 3 000 000, it is about the same speed.

I tried running the tests in both order of execution and the results are the same.

Here is the IsPrime method that gets called in the PLINQ query:

// uses inneficient trial division algorithm
private bool IsPrime(int number)
{
    if (number == 1)
        return false;

    for (int divisor = 2; divisor <= Math.Sqrt(number); divisor++)
    {
        if (number % divisor == 0)
            return false;
    }

    return true;
}

What explains this?

Romola answered 6/6, 2012 at 0:10 Comment(4)
As a side note, DateTime is not very good for performance measurements, using StopWatch is better.Mogul
@svick: Very good point. I knew DateTime wasn't good for serious measurements but didn't know about StopWatch. Thanks. For production code it's better to use a profiler, but this was just quick and dirty code I wrote as I read the PLINQ and TPL doc. Next time I will use StopWatch in such a situation!Romola
If I run your code, ordered is slightly slower (5.67 s vs. 5.66 s for unordered, averaged over several attempts), but the variance between the attempts is higher than the difference between ordered and unordered, so I think that there is no statistically significant difference in this case (at least as measured on my quad-core).Mogul
@svick: I will update my post with my average times: 22 seconds for the unordered and 18 seconds for the ordered one. You have a fast machine.Romola
M
4

Do You Always Run The Tests In The Same Order?

I've recreated your results on my machine and I found that, no matter, the 'Ordered' results were faster. I used slightly modified code for benchmarking:

static void Main(string[] args)
{
    const int size = 9000000;
    BenchIt("Parallel", ParallelCalculatePrimesUpTo, size);
    BenchIt("Ordered ", OrderedParallelCalculatePrimesUpTo, size);
    Console.ReadKey();
}

public static void BenchIt(string desc, Func<int, IEnumerable<int>> myFunc, int size)
{
    var sw = new Stopwatch();            
    sw.Restart();
    myFunc.Invoke(size).ToList();
    sw.Stop();
    Console.WriteLine("{0} {1}",desc, sw.Elapsed);
}

My results showed, initially, that you were correct. The ordered method was faster. However, if I SWAPPED the order of the calls, I found that the non-ordered method was faster. In other words, whichever one went second was faster. Presumably, because of the thread-pool management that the Task Parallel Library is doing.

But - the differences between the two on my machine were very small. Nowhere near the amount of difference you saw.

What's Your Hardware Look Like?

PLINQ does some guessing about how to execute the fastest. I don't know if this will directly help you in this case; but you might want to set a break point in the middle of IsPrime and stop on it after a few hundred iterations and examine the thread window.

How many threads do you have when executing ParallelCalculatedPrimesUpTo verse OrderedParallelCalculatePrimesUpTo? I'm reaching here; but it's possible that it's deciding on different values on your machine that creates the unexpected times you are seeing. On my machine - I get eight threads, each time - but my times are NEARLY identical - whichever one is called first is slower because of the creation of those threads. But you aren't guaranteed a particular number of threads (you can set the maximum, but you can't force them to be used).

Mingo answered 6/6, 2012 at 10:23 Comment(3)
Regarding the first question, I get the same pattern in the results if I run them in any order or if I run them separately in different tests runs.Romola
Regarding the second question (hardware): AMD Phenom 9550 Quad-Core Process 2.20 Ghz, 6.00 GB RAM, 64 bit OS/cpu, NVIDIA gfx card (1GB VRAM), 7200 rpm HDD.Romola
The unordered test show a total of 14 threads in the window, but 2 seem to be from the test agent. The ordered one shows 16 threads, but again only 2 from the test agent. In both cases more of the threads seem unrelated to my program or an non-descriptive (to my eyes) names.Romola
L
1

Can you tell us what the CPU utilization is across the 4 different cores? It's possible that AsOrdered() is forcing more sequential calls to happen on the same core. With improved locality, silicon-level caching and branch prediction may be working in your favor.

Another possibility is that there's some optimization in the .NET framework for the case of monotonically increasing integers (int.Range) when using the AsOrdered() projection. I'm not sure how that would work, but it's possible.

An interesting test for comparison would be to generate a third set of numbers, in random order (obviously, you'd have to randomize them ahead of time and then work off three arrays). Then see if that has anything to do with it?

Leitao answered 6/6, 2012 at 2:58 Comment(4)
I'm getting pretty much the same thing for both, here: gillesleblanc.wordpress.com/?attachment_id=361 is a picture from Windows Task Manager. This a the four cores. The four rise at around 100% for both test. The first test being th first rise, and the second test the second rise.Romola
Okay - I know there are some fancy tools out there from Intel, AMD, and others that let you monitor the caching etc., but I'm not sure about details. Simpler option: I just noticed that the two tests are running in separate [TestMethod] methods. Can you combine them into a single [TestMethod] - maybe one method for each ordering - and compare the effect of ordering that way? Then you've ruled out some of the MSTest environmental factors. (In fact, go all the way and write this as a standalone console app. I think MSTest has a facility in it that checks for stalled/hung tests.)Leitao
I wrote a console app calling both methods 4 times the following order : unordered, ordered, ordered, unordered, unordered, ordered, ordered, unordered. Except for the first call which is slower all the next call fall close to my results in my unit tests.Romola
Well, at this point it's either a low-level optimization in the BCL (something along the lines of @RobP's suggestion) or something at the silicon level. I don't know how we'd be able to tell the difference - aside from reimplementing AsOrdered() yourself. :)Leitao

© 2022 - 2024 — McMap. All rights reserved.