Slow execution under 64 bits. Possible RyuJIT bug?

Asked 18/11, 2015 at 7:49 Answered 18/11, 2015 at 9:12

Solved c#performance visual-studio-2015 clr ryujit

I have the following C# code trying to benchmark under release mode:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ConsoleApplication54
{
class Program
{
    static void Main(string[] args)
    {
        int counter = 0;
        var sw = new Stopwatch();
        unchecked
        {
            int sum = 0;
            while (true)
            {
                try
                {
                    if (counter > 20)
                        throw new Exception("exception");
                }
                catch
                {
                }

                sw.Restart();
                for (int i = 0; i < int.MaxValue; i++)
                {
                    sum += i;
                }
                counter++;
                Console.WriteLine(sw.Elapsed);
            }

        }
    }
}
}

I am on a 64-bit machine and VS 2015 installed. When I run the code under 32-bit, it runs each iteration around 0.6 seconds, printed to the console. When I run it under 64-bit then the duration for each iteration simply jumps to 4 seconds! I tried the sample code in my colleagues computer which only has VS 2013 installed. There both 32-bit and 64-bit versions run around 0.6 seconds.

In addition to that, if we just remove the try catch block, it also runs in 0.6 seconds with VS 2015 in 64-bit.

This looks like a serious RyuJIT regression when there is a try catch block. Am I correct ?

Cobalt answered 18/11, 2015 at 7:49 Comment(10)

Your computer is super genius! for me it takes around 10 seconds each iteration :( and no difference here. both 32bit and 64bit gives pretty same results. – Limitative 18/11, 2015 at 7:59

@M.kazem. I don't think this is possible. My computer is a Surface Pro 3 i7 with U level CPU. It is definitely not a powerhouse. Are you sure you run it in Release mode and Start without debugging ? BTW I tried 4 different computers so far. – Cobalt 18/11, 2015 at 8:0

Oh. no no. I Got it. because you enabled option Optimize code in solution properties. now i get 1.3s for 32bit and 3.9s for 64 bit. – Limitative 18/11, 2015 at 8:1

@M.kazemAkhgary and now try removing the try catch block :) – Cobalt 18/11, 2015 at 8:5

yeah. thats right. going to check compiled code. – Limitative 18/11, 2015 at 8:8

Just out of curiosity, what happens if you change your ints to longs? Does the 64 bit version run quicker? – Pyonephritis 18/11, 2015 at 8:15

@DarrenGourley No improvements with long either. Similar timings. (though 32 bit code got a bit slower) – Cobalt 18/11, 2015 at 8:19

It may be a regression, and it may be annoying, but I'd usually reserve the word bug, in this context, for if it generated incorrect code. As with all things, there are trade offs, and its inability in some circumstances to identify a particular optimization within the tight time constraints it works in is something you sometimes have to live with. – Nanette 18/11, 2015 at 8:19

Very brittle benchmark because the code does nothing. Ideally it would be deleted entirely by the JIT. Benchmark something more real. – Evacuate 18/11, 2015 at 11:13

@Evacuate if you are curious the real purpose I wrote above code to measure the performance impact of a tool I used called DebugDiag. So I was attaching DebugDiag to above code and trying to get dumps and see how it would affect the overall performance of this code. So I wasn't really trying to measure anything with respect to above code. – Cobalt 18/11, 2015 at 12:50

Bench-marking is a fine art. Make a small modification to your code:

   Console.WriteLine("{0}", sw.Elapsed, sum);

And you'll now see the difference disappear. Or to put it another way, the x86 version is now just as slow as the x64 code. You can probably figure out what RyuJIT doesn't do what the legacy jitter did from this minor change, it doesn't eliminate the unnecessary

   sum += i;

Something you can see when you look at the generated machine code with Debug > Windows > Disassembly. Which is indeed a quirk in RyuJIT. Its dead code elimination isn't as thorough as the legacy jitter. Otherwise not entirely without reason, Microsoft rewrote the x64 jitter because of bugs that it could not easily fix. One of them was a fairly nasty issue with the optimizer, it had no upper-bound on the amount of time it spent on optimizing a method. Causing rather poor behavior on methods with very large bodies, it could be out in the woods for dozens of milliseconds and cause noticeable execution pauses.

Calling it a bug, meh, not really. Write sane code and the jitter won't disappoint you. Optimization does forever start at the usual place, between the programmer's ears.

Piapiacenza answered 18/11, 2015 at 8:22 Comment(0)

After a bit of testing I've got some interesting results. My testing revolved around the try catch block. As the OP pointed out, if you remove this block, the time to execute is the same. I've narrowed this down a bit further and have concluded that it's because of counter variable in if statement in the try block.

Lets remove the redundant throw:

                try
                {
                    if (counter== 0) { }
                }
                catch
                {
                }

You will get the same results with this code as you did with the original code.

Lets change counter to be an actual int value:

                try
                {
                    if (1 == 0) { }
                }
                catch
                {
                }

With this code, the 64 bit version has decreased in execution time from 4 seconds to about 1.7 seconds. Still double that of the 32 bit version. However I thought that was interesting. Unfortunately after my quick Google search I haven't come up with a reason, but I'll dig a bit more and update this answer if I find out why this is happening.

As for the remaining second that we would like to shave off the 64 bit version, I can see that this is down to incrementing the sum by i in your for loop. Lets change this so that sum does not exceed its bounds:

            for (int i = 0; i < int.MaxValue; i++)
            {
                sum ++;
            }

This change (along with the change in the try block) will reduce the execution time of the 64 bit app to 0.7 seconds. My reasoning for the 1 second difference in time is due to the artificial way that the 64 bit version needs to handle an int which is naturally 32 bits.

In the 32 bit version, there are 32 bits allocated to the Int32 (sum). When sum goes above its bounds it is easy to determine this fact.

In the 64 bit version, there are 64 bits allocated to the Int32 (sum). When sum goes above its bounds there needs to be a mechanism to detect this, which could lead to the slow down. Perhaps even the operation of adding sum & i takes longer due to the increase in redundant bits allocated.

I am theorising here; so don't take this as gospel. I just thought I would post my findings. I'm sure someone else will be able to shed some light on the problem that I've found.

Update

@HansPassant 's answer pointed out that the sum += i; line may be eliminated as it is deemed unnecessary, which makes perfect sense, sum is not being used outside of the for loop. After he introduced the value of sum outside of the for loop, we noticed that the x86 version was just as slow as the x64 version. So I decided to do a bit of testing. Lets change the for loop and printing to the following:

                int x = 0;
                for (int i = 0; i < int.MaxValue; i++)
                {
                    sum += i;
                    x = sum;
                }
                counter++;
                Console.WriteLine(sw.Elapsed + "  " +  x);

You can see that I've introduced a new int x which is being assigned the value of sum in the for loop. That value of x is not being written out to the console. sum doesn't leave the for loop. This, believe it or not, actually reduces the execution time for x64 to 0.7 seconds. However, x86 version jumps up to 1.4 seconds.

Pyonephritis answered 18/11, 2015 at 9:12 Comment(3)

It doesn't matter if sum goes above its bounds - it's running in an unchecked context, so overflow is ignored - if the code is running at all. As Hans points out, since the sum variable isn't read from after this loop, it's possible (x86, older x64 JIT) to completely eliminate the loop as an optimization. – Nanette 18/11, 2015 at 9:19

@Nanette You've actually raised a very interesting point.. Let me update my answer with my findings. – Pyonephritis 18/11, 2015 at 9:31

@Nanette I don't think the fact that this is unchecked actually matters. That just means it's definitely not going to throw an OverflowException when sum reaches it's bounds. (Unlike if you changed it to checked). Anyway, it still needs to perform an artificial check of the bounds in 64 bit because the int has 64 bits allocated to it. In theory I guess. Again, I'm not an expert in these matters, this just interested me. – Pyonephritis 18/11, 2015 at 9:54

Update

Recommended topics

Hot tags