Code sample that shows casting to uint is more efficient than range check
Asked Answered
D

3

7

So I am looking at this question and the general consensus is that uint cast version is more efficient than range check with 0. Since the code is also in MS's implementation of List I assume it is a real optimization. However I have failed to produce a code sample that results in better performance for the uint version. I have tried different tests and there is something missing or some other part of my code is dwarfing the time for the checks. My last attempt looks like this:

class TestType
{
    public TestType(int size)
    {
        MaxSize = size;
        Random rand = new Random(100);
        for (int i = 0; i < MaxIterations; i++)
        {
            indexes[i] = rand.Next(0, MaxSize);
        }
    }

    public const int MaxIterations = 10000000;
    private int MaxSize;
    private int[] indexes = new int[MaxIterations];

    public void Test()
    {
        var timer = new Stopwatch();

        int inRange = 0;
        int outOfRange = 0;

        timer.Start();
        for (int i = 0; i < MaxIterations; i++)
        {
            int x = indexes[i];
            if (x < 0 || x > MaxSize)
            {
                throw new Exception();

            }

            inRange += indexes[x];
        }
        timer.Stop();

        Console.WriteLine("Comparision 1: " + inRange + "/" + outOfRange + ", elapsed: " + timer.ElapsedMilliseconds + "ms");

        inRange = 0;
        outOfRange = 0;

        timer.Reset();
        timer.Start();

        for (int i = 0; i < MaxIterations; i++)
        {
            int x = indexes[i];
            if ((uint)x > (uint)MaxSize)
            {
                throw new Exception();
            }

            inRange += indexes[x];
        }

        timer.Stop();

        Console.WriteLine("Comparision 2: " + inRange + "/" + outOfRange + ", elapsed: " + timer.ElapsedMilliseconds + "ms");

    }
}

class Program
{
    static void Main()
    {
        TestType t = new TestType(TestType.MaxIterations);
        t.Test();
        TestType t2 = new TestType(TestType.MaxIterations);
        t2.Test();
        TestType t3 = new TestType(TestType.MaxIterations);
        t3.Test();
    }
}

The code is a bit of a mess because I tried many things to make uint check perform faster like moving the compared variable into a field of a class, generating random index access and so on but in every case the result seems to be the same for both versions. So is this change applicable on modern x86 processors and can someone demonstrate it somehow?

Note that I am not asking for someone to fix my sample or explain what is wrong with it. I just want to see the case where the optimization does work.

Delafuente answered 31/3, 2015 at 21:23 Comment(0)
R
3

I would suggest attempting code which does not throw an exception when the index is out of range. Exceptions are incredibly expensive and can completely throw off your bench results.

The code below does a timed-average bench for 1,000 iterations of 1,000,000 results.

using System;
using System.Diagnostics;

namespace BenchTest
{
    class Program
    {
        const int LoopCount = 1000000;
        const int AverageCount = 1000;

        static void Main(string[] args)
        {
            Console.WriteLine("Starting Benchmark");
            RunTest();
            Console.WriteLine("Finished Benchmark");

            Console.Write("Press any key to exit...");
            Console.ReadKey();
        }

        static void RunTest()
        {
            int cursorRow = Console.CursorTop; int cursorCol = Console.CursorLeft;

            long totalTime1 = 0; long totalTime2 = 0;
            long invalidOperationCount1 = 0; long invalidOperationCount2 = 0;

            for (int i = 0; i < AverageCount; i++)
            {
                Console.SetCursorPosition(cursorCol, cursorRow);
                Console.WriteLine("Running iteration: {0}/{1}", i + 1, AverageCount);

                int[] indexArgs = RandomFill(LoopCount, int.MinValue, int.MaxValue);
                int[] sizeArgs = RandomFill(LoopCount, 0, int.MaxValue);

                totalTime1 += RunLoop(TestMethod1, indexArgs, sizeArgs, ref invalidOperationCount1);
                totalTime2 += RunLoop(TestMethod2, indexArgs, sizeArgs, ref invalidOperationCount2);
            }

            PrintResult("Test 1", TimeSpan.FromTicks(totalTime1 / AverageCount), invalidOperationCount1);
            PrintResult("Test 2", TimeSpan.FromTicks(totalTime2 / AverageCount), invalidOperationCount2);
        }

        static void PrintResult(string testName, TimeSpan averageTime, long invalidOperationCount)
        {
            Console.WriteLine(testName);
            Console.WriteLine("    Average Time: {0}", averageTime);
            Console.WriteLine("    Invalid Operations: {0} ({1})", invalidOperationCount, (invalidOperationCount / (double)(AverageCount * LoopCount)).ToString("P3"));
        }

        static long RunLoop(Func<int, int, int> testMethod, int[] indexArgs, int[] sizeArgs, ref long invalidOperationCount)
        {
            Stopwatch sw = new Stopwatch();

            Console.Write("Running {0} sub-iterations", LoopCount);
            sw.Start();
            long startTickCount = sw.ElapsedTicks;

            for (int i = 0; i < LoopCount; i++)
            {
                invalidOperationCount += testMethod(indexArgs[i], sizeArgs[i]);
            }

            sw.Stop();
            long stopTickCount = sw.ElapsedTicks;

            long elapsedTickCount = stopTickCount - startTickCount;
            Console.WriteLine(" - Time Taken: {0}", new TimeSpan(elapsedTickCount));
            return elapsedTickCount;
        }

        static int[] RandomFill(int size, int minValue, int maxValue)
        {
            int[] randomArray = new int[size];

            Random rng = new Random();

            for (int i = 0; i < size; i++)
            {
                randomArray[i] = rng.Next(minValue, maxValue);
            }

            return randomArray;
        }

        static int TestMethod1(int index, int size)
        {
            return (index < 0 || index >= size) ? 1 : 0;
        }

        static int TestMethod2(int index, int size)
        {
            return ((uint)(index) >= (uint)(size)) ? 1 : 0;
        }
    }
}
Ramtil answered 31/3, 2015 at 22:44 Comment(7)
Wouldn't it be better to duplicate runloop and inline the test case: gist.github.com/bbarry/8d043d04a4c170a64b55Waggoner
In my code the exception branch is never executed so it should not affect performance. It seems however that there is some effect from inlining as @Jon Hanna suggestsDelafuente
Oh I just noticed that your method is used in a delegate so it can't be inlining here. Or can it?Delafuente
The methods should be small enough that if not delegates they would in fact both be inlined, so this does a good job of showing just the effect of dropping one branch. Pushing the case to force inlining to be a difference between the two would certainly be interesting. My one criticism is that I would suggest seeding the random numbers so that both get the same pseudo-random sequence, just to be sure that one wasn't luckier in that regard than the other.Jolyn
@JonHanna, Are you saying that each iteration should use the same pseudo-random sequence? Each test method is already using the same index and size sequences, which are currently generated per iteration.Ramtil
@Stilgar, I used delegates specifically to prevent inlining to keep the results as 'clean' as possible. I think excluding exceptions in the code here, even if they will never be hit is a more fair comparison of the two comparison statements. If your code never executes the branch statement, then their will always be two comparisons per method call where-as the other method will always be one. In my code sample, I tried to give a fair comparison between the two methods whether or not the branch will be hit and without the overhead that exceptions bring.Ramtil
Yes, same random sequence. The odds that it makes any difference is pretty slight, but if you've the same difference then you know for sure that it doesn't.Jolyn
R
14
       if (x < 0 || x > MaxSize)

The comparison is performed by the CMP processor instruction (Compare). You'll want to take a look at Agner Fog's instruction tables document (PDF), it list the cost of instructions. Find your processor back in the list, then locate the CMP instruction.

For mine, Haswell, CMP takes 1 cycle of latency and 0.25 cycles of throughput.

A fractional cost like that could use an explanation, Haswell has 4 integer execution units that can execute instructions at the same time. When a program contains enough integer operations, like CMP, without an interdependency then they can all execute at the same time. In effect making the program 4 times faster. You don't always manage to keep all 4 of them busy at the same time with your code, it is actually pretty rare. But you do keep 2 of them busy in this case. Or in other words, two comparisons take just as long as single one, 1 cycle.

There are other factors at play that make the execution time identical. One thing helps is that the processor can predict the branch very well, it can speculatively execute x > MaxSize in spite of the short-circuit evaluation. And it will in fact end up using the result since the branch is never taken.

And the true bottleneck in this code is the array indexing, accessing memory is one of the slowest thing the processor can do. So the "fast" version of the code isn't faster even though it provides more opportunity to allow the processor to concurrently execute instructions. It isn't much of an opportunity today anyway, a processor has too many execution units to keep busy. Otherwise the feature that makes HyperThreading work. In both cases the processor bogs down at the same rate.

On my machine, I have to write code that occupies more than 4 engines to make it slower. Silly code like this:

     if (x < 0 || x > MaxSize || x > 10000000 || x > 20000000 || x > 3000000) {
         outOfRange++; 
     }
     else {
         inRange++;
     }

Using 5 compares, now I can a difference, 61 vs 47 msec. Or in other words, this is a way to count the number of integer engines in the processor. Hehe :)

So this is a micro-optimization that probably used to pay off a decade ago. It doesn't anymore. Scratch it off your list of things to worry about :)

Rafa answered 31/3, 2015 at 23:22 Comment(3)
The memory access is probably the the reason anything else disappears in the noise. Each access should be a cache miss here.; The overlapping integer unit execution point is a good one. It might not hold in sequentially dependent code, though. Here, all loop iterations are independent and therefore overlap perfectly.Cuprite
I tested for that as well, changing the indexing expression to [i & 1023] so it always hits the L1 cache. Didn't affect the outcome. Still, an L1 access has 3 cycles of latency so probably enough.Rafa
I'm gonna test this on an arduino where there's no branch prediction and different execution unitsVoltameter
R
3

I would suggest attempting code which does not throw an exception when the index is out of range. Exceptions are incredibly expensive and can completely throw off your bench results.

The code below does a timed-average bench for 1,000 iterations of 1,000,000 results.

using System;
using System.Diagnostics;

namespace BenchTest
{
    class Program
    {
        const int LoopCount = 1000000;
        const int AverageCount = 1000;

        static void Main(string[] args)
        {
            Console.WriteLine("Starting Benchmark");
            RunTest();
            Console.WriteLine("Finished Benchmark");

            Console.Write("Press any key to exit...");
            Console.ReadKey();
        }

        static void RunTest()
        {
            int cursorRow = Console.CursorTop; int cursorCol = Console.CursorLeft;

            long totalTime1 = 0; long totalTime2 = 0;
            long invalidOperationCount1 = 0; long invalidOperationCount2 = 0;

            for (int i = 0; i < AverageCount; i++)
            {
                Console.SetCursorPosition(cursorCol, cursorRow);
                Console.WriteLine("Running iteration: {0}/{1}", i + 1, AverageCount);

                int[] indexArgs = RandomFill(LoopCount, int.MinValue, int.MaxValue);
                int[] sizeArgs = RandomFill(LoopCount, 0, int.MaxValue);

                totalTime1 += RunLoop(TestMethod1, indexArgs, sizeArgs, ref invalidOperationCount1);
                totalTime2 += RunLoop(TestMethod2, indexArgs, sizeArgs, ref invalidOperationCount2);
            }

            PrintResult("Test 1", TimeSpan.FromTicks(totalTime1 / AverageCount), invalidOperationCount1);
            PrintResult("Test 2", TimeSpan.FromTicks(totalTime2 / AverageCount), invalidOperationCount2);
        }

        static void PrintResult(string testName, TimeSpan averageTime, long invalidOperationCount)
        {
            Console.WriteLine(testName);
            Console.WriteLine("    Average Time: {0}", averageTime);
            Console.WriteLine("    Invalid Operations: {0} ({1})", invalidOperationCount, (invalidOperationCount / (double)(AverageCount * LoopCount)).ToString("P3"));
        }

        static long RunLoop(Func<int, int, int> testMethod, int[] indexArgs, int[] sizeArgs, ref long invalidOperationCount)
        {
            Stopwatch sw = new Stopwatch();

            Console.Write("Running {0} sub-iterations", LoopCount);
            sw.Start();
            long startTickCount = sw.ElapsedTicks;

            for (int i = 0; i < LoopCount; i++)
            {
                invalidOperationCount += testMethod(indexArgs[i], sizeArgs[i]);
            }

            sw.Stop();
            long stopTickCount = sw.ElapsedTicks;

            long elapsedTickCount = stopTickCount - startTickCount;
            Console.WriteLine(" - Time Taken: {0}", new TimeSpan(elapsedTickCount));
            return elapsedTickCount;
        }

        static int[] RandomFill(int size, int minValue, int maxValue)
        {
            int[] randomArray = new int[size];

            Random rng = new Random();

            for (int i = 0; i < size; i++)
            {
                randomArray[i] = rng.Next(minValue, maxValue);
            }

            return randomArray;
        }

        static int TestMethod1(int index, int size)
        {
            return (index < 0 || index >= size) ? 1 : 0;
        }

        static int TestMethod2(int index, int size)
        {
            return ((uint)(index) >= (uint)(size)) ? 1 : 0;
        }
    }
}
Ramtil answered 31/3, 2015 at 22:44 Comment(7)
Wouldn't it be better to duplicate runloop and inline the test case: gist.github.com/bbarry/8d043d04a4c170a64b55Waggoner
In my code the exception branch is never executed so it should not affect performance. It seems however that there is some effect from inlining as @Jon Hanna suggestsDelafuente
Oh I just noticed that your method is used in a delegate so it can't be inlining here. Or can it?Delafuente
The methods should be small enough that if not delegates they would in fact both be inlined, so this does a good job of showing just the effect of dropping one branch. Pushing the case to force inlining to be a difference between the two would certainly be interesting. My one criticism is that I would suggest seeding the random numbers so that both get the same pseudo-random sequence, just to be sure that one wasn't luckier in that regard than the other.Jolyn
@JonHanna, Are you saying that each iteration should use the same pseudo-random sequence? Each test method is already using the same index and size sequences, which are currently generated per iteration.Ramtil
@Stilgar, I used delegates specifically to prevent inlining to keep the results as 'clean' as possible. I think excluding exceptions in the code here, even if they will never be hit is a more fair comparison of the two comparison statements. If your code never executes the branch statement, then their will always be two comparisons per method call where-as the other method will always be one. In my code sample, I tried to give a fair comparison between the two methods whether or not the branch will be hit and without the overhead that exceptions bring.Ramtil
Yes, same random sequence. The odds that it makes any difference is pretty slight, but if you've the same difference then you know for sure that it doesn't.Jolyn
J
3

You aren't comparing like with like.

The code you were talking about not only saved one branch by using the optimisation, but also 4 bytes of CIL in a small method.

In a small method 4 bytes can be the difference in being inlined and not being inlined.

And if the method calling that method is also written to be small, then that can mean two (or more) method calls are jitted as one piece of inline code.

And maybe some of it is then, because it is inline and available for analysis by the jitter, optimised further again.

The real difference is not between index < 0 || index >= _size and (uint)index >= (uint)_size, but between code that has repeated efforts to minimise the method body size and code that does not. Look for example at how another method is used to throw the exception if necessary, further shaving off a couple of bytes of CIL.

(And no, that's not to say that I think all methods should be written like that, but there certainly can be performance differences when one does).

Jolyn answered 1/4, 2015 at 0:31 Comment(1)
You have a point! We initially started discussing this with Stilgar as optimisation vs readability and then I wrote a simple test and noticed, that surprisingly first comparison works faster in debug mode (and on dotnetfiddle) if there are more out-of-range values, while using the cast performs better if most values are in-range. Also as someone pointed out 2nd always works faster in release mode.Maddis

© 2022 - 2024 — McMap. All rights reserved.