Why does adding local variables make .NET code slower
Asked Answered
S

5

44

Why does commenting out the first two lines of this for loop and uncommenting the third result in a 42% speedup?

int count = 0;
for (uint i = 0; i < 1000000000; ++i) {
    var isMultipleOf16 = i % 16 == 0;
    count += isMultipleOf16 ? 1 : 0;
    //count += i % 16 == 0 ? 1 : 0;
}

Behind the timing is vastly different assembly code: 13 vs. 7 instructions in the loop. The platform is Windows 7 running .NET 4.0 x64. Code optimization is enabled, and the test app was run outside VS2010. [Update: Repro project, useful for verifying project settings.]

Eliminating the intermediate boolean is a fundamental optimization, one of the simplest in my 1980's era Dragon Book. How did the optimization not get applied when generating the CIL or JITing the x64 machine code?

Is there a "Really compiler, I would like you to optimize this code, please" switch? While I sympathize with the sentiment that premature optimization is akin to the love of money, I could see the frustration in trying to profile a complex algorithm that had problems like this scattered throughout its routines. You'd work through the hotspots but have no hint of the broader warm region that could be vastly improved by hand tweaking what we normally take for granted from the compiler. I sure hope I'm missing something here.

Update: Speed differences also occur for x86, but depend on the order that methods are just-in-time compiled. See Why does JIT order affect performance?

Assembly code (as requested):

    var isMultipleOf16 = i % 16 == 0;
00000037  mov         eax,edx 
00000039  and         eax,0Fh 
0000003c  xor         ecx,ecx 
0000003e  test        eax,eax 
00000040  sete        cl 
    count += isMultipleOf16 ? 1 : 0;
00000043  movzx       eax,cl 
00000046  test        eax,eax 
00000048  jne         0000000000000050 
0000004a  xor         eax,eax 
0000004c  jmp         0000000000000055 
0000004e  xchg        ax,ax 
00000050  mov         eax,1 
00000055  lea         r8d,[rbx+rax] 
    count += i % 16 == 0 ? 1 : 0;
00000037  mov         eax,ecx 
00000039  and         eax,0Fh 
0000003c  je          0000000000000042 
0000003e  xor         eax,eax 
00000040  jmp         0000000000000047 
00000042  mov         eax,1 
00000047  lea         edx,[rbx+rax] 
Seashore answered 29/4, 2012 at 3:17 Comment(19)
I'd be curious to see the different assembly code. Could you post it?Wagtail
have you tested bool isMultipleOf16 = ...?Amphiaster
@Amphiaster - that wouldn't make a difference - var is "compiler, please infer the type of this variable, and pretend I wrote that instead". In this case, it will have inferred bool for itself.Fanfaronade
how about static isMulitppleOF16?Amphiaster
@EdwardBrey: Since you did this in Debug mode all bets are offNotification
The only way to get this code is by having the jitter optimizer disabled. Invalid test.Overbid
@Notification You sound rather confident. I was careful to ensure I was in Release mode and that optimizations remained enabled. To verify, I turned optimization off and verified that the timings got slower and the assembly code got larger. I changed the Project Configuration to "Debug" and turned on the "Suppress JIT optimization on module" setting. When I went back to unsuppressed release optimized, the code got smaller and faster again. If I am making an error, is there anything I should be take note of to spot it?Seashore
@Hans I tried two separate computers with the same result. I also reimplemented on multiple projects, because I couldn't believe it was really true. Still, same result. I haven't done anything I know of to disable the JIT optimizers on my systems, although I don't know where to look for this. Can you repro?Seashore
Sure, I can get the same code you posted by turning on the "Suppress JIT optimization" debugger option. The optimized code is very different.Overbid
@Hans The 13 lines of assembly code above are with "Suppress JIT optimization" turned off. If I turn suppression on, it gets even worse and balloons to 21 lines of assembly code.Seashore
I cannot repro your result with the exact same OS and VS+.NET versions, no idea why you are getting unoptimized code.Overbid
@Hans Let's try to narrow this down. I put a repro on GitHub. What do you get for results?Seashore
@EdwardBrey: I can't find a source at the moment, but I believe the jitter and/or other optimizer settings are different if you have a debugger attached at all (that is, if you're running from Visual Studio, even if you compiled in "Release" mode). Try running your code from the command line (not from VS) and see what happens.Slogan
@Daniel I ran all my timing tests from the command line outside Visual Studio. I only ran within Visual Studio to get assembly code listings. As a double-check, just now I tried running the timing test with Visual Studio not even running. I still see the different timings.Seashore
@EdwardBrey: Running the code as x86, makes both run the same time. Edit: But only on .NET 4. .NET 2 has a difference with both.Fiddlehead
@Fiddlehead Could you try the x86 configuration on the test project on github? (Github has a ZIP button to quickly grab a zip file of the project.) I get very different timings between the one- and two-statement tests, even for x86.Seashore
@EdwardBrey: That's what I did.Fiddlehead
@Fiddlehead If you turn off "Suppress JIT optimization" in the VS2010 options and put a breakpoint on a statement in the loop, do you see different assembly code for the different forms, specifically the addition of sete, movzx, and test instructions for the two-statement form?Seashore
When I dump the JIT compiled code for the two versions (via WinDbg), I do see differences, but I also see significantly more code than you list in your question. Remember there is not a straight forward mapping between the source lines and the JIT compiled code, so you should list all of it. Comparing just the body of the loop may not provide enough detail for a comparison.Doubletongue
S
3

It's a bug in the .NET Framework.

Well, really I'm just speculating, but I submitted a bug report on Microsoft Connect to see what they say. After Microsoft deleted that report, I resubmitted it on roslyn project on GitHub.

Update: Microsoft has moved the issue to the coreclr project. From the comments on the issue, calling it a bug seems a bit strong; it's more of a missing optimization.

Seashore answered 30/4, 2012 at 11:39 Comment(3)
If I had a dollar for every time a programmer told me, "My code doesn't work. It must be a bug in the framework (or compiler or runtime library, etc.)," and later found out that it was a bug in his own code, I could retire.Prizefight
@Jim: I've seen it plenty of times myself. The best antidote I know is to isolate the behavior as fundamentally as possible and provide the vendor with a repro. And to keep a wait and see attitude. That's where we're at.Seashore
@TankorSmash Perhaps Microsoft deleted it as part of moving the code to GitHub. It's not on my Connect dashboard anymore. It seems that many issues I reported are gone. Some sort of notification would have been nice. I resubmitted the issue to the GitHub project and updated the answer accordingly.Seashore
L
9

Question should be "Why do I see such a difference on my machine?". I cannot reproduce such a huge speed difference and suspect there is something specific to your environment. Very difficult to tell what it can be though. Can be some (compiler) options you have set some time ago and forgot about them.

I have create a console application, rebuild in Release mode (x86) and run outside VS. Results are virtually identical, 1.77 seconds for both methods. Here is the exact code:

static void Main(string[] args)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
    int count = 0;

    for (uint i = 0; i < 1000000000; ++i)
    {
        // 1st method
        var isMultipleOf16 = i % 16 == 0;
        count += isMultipleOf16 ? 1 : 0;

        // 2nd method
        //count += i % 16 == 0 ? 1 : 0;
    }

    sw.Stop();
    Console.WriteLine(string.Format("Ellapsed {0}, count {1}", sw.Elapsed, count));
    Console.ReadKey();
}

Please, anyone who has 5 minutes copy the code, rebuild, run outside VS and post results in comments to this answer. I'd like to avoid saying "it works on my machine".

EDIT

To be sure I have created a 64 bit Winforms application and the results are similar as in the the question - the first method is slower (1.57 sec) than the second one (1.05 sec). The difference I observe is 33% - still a lot. Seems there is a bug in .NET4 64 bit JIT compiler.

Lejeune answered 30/4, 2012 at 9:46 Comment(10)
First method: 1.8736291s, second method: 1.8566318s on my machine, rebuilt with Release (x86), ran outside VS, using the exact same code.Advection
You need to do something with count (such as include it in your WriteLine statement). Otherwise the optimizer does some selective optimization, which changes with the timing.Seashore
@EdwardBrey, I have changed my 64 bit tests and got the same results as you are now. Corrected my answer to reflect.Lejeune
@EdwardBrey, I can only reproduce it in 64 bit appLejeune
I got 1.86s (1st) and 1.38s (2nd) in x64, while in x86 I got around 2.09s for both methods.Myrmeco
I updated the test project on github to include x64 and x86 configurations. After doing a batch rebuild of both configurations and running each .exe outside VS2010 on two computers (both Core i5), I see a very different timings for Test1 and Test2 on both x64 and x86. Maciej and Marcus, to help eliminate variables, could you each grab the project and try it? (Github has a ZIP button to quickly grab a zip file of the project.)Seashore
@EdwardBrey Running your test gives me slower execution on Multiline version on both platforms. But if I change the test so that it runs 4xMultiline and then 4xSingleline there is no speed difference on x86 (x64 is unaffected)Lejeune
@Lejeune +1 for a great observation. It looks like which method gets called first is what matters. This appears to be almost certainly due to JIT order. But why JIT order matters is perplexing. I updated the question body with a link to the new question that your observation prompted.Seashore
@EdwardBrey Can you upvote my answer instead of the comment as it will show "real" appreciation :-)Lejeune
@Lejeune Sorry, but I don't think your answer really answers the question. It helped us find a confounding issue, namely that for x86, alignment penalties can cause well-optimized code to run as slowly as less-than-fully optimized code. But the question still remains, why are there cases where less-than-fully optimized x64 (and x86) code is generated in the first place?Seashore
A
4

I can't speak to the .NET compiler, or its optimizations, or even WHEN it performs its optimizations.

But in this specific case, if the compiler folded that boolean variable in to the actual statement, and you were to try and debug this code, the optimized code would not match the code as written. You would not be able to single step over the isMulitpleOf16 assignment and check it value.

Thats just one example of where the optimization may well be turned off. There could be others. The optimization may happen during the load phase of the code, rather than the code generation phase from the CLR.

The modern runtimes are pretty complicated, especially if you throw in JIT and dynamic optimization over run time. I feel grateful the code does what it says at all sometimes.

Andantino answered 29/4, 2012 at 4:20 Comment(3)
I wondered to myself when I saw the assembly code if somehow optimization was disabled. I got the assembly code by stopping at a breakpoint in the VS2010 debugger and using the Disassembly window (whereas the timings I got by running without a debugger). As a test, I turned on the Tools > Options > Debugging > General > "Suppress JIT optimization on module" setting. Sure enough, the assembly code got even larger.Seashore
In the native C++ world, it's perfectly normal for breakpoints and code order to be weird when optimizations are enabled. Likewise, variables like isMultipleOf16 are not always available in the debugger. That's why there is debug mode. At the end of the day, we're still running the same machine code, so I don't see why the CLR would make things any different. Indeed, when exceptions occur in C#, even in debug mode I sometimes get a message about a variable's value being "optimized away", even in debug mode.Seashore
+1 for the remark that debug settings can affect codegeneration.Irreplaceable
S
3

It's a bug in the .NET Framework.

Well, really I'm just speculating, but I submitted a bug report on Microsoft Connect to see what they say. After Microsoft deleted that report, I resubmitted it on roslyn project on GitHub.

Update: Microsoft has moved the issue to the coreclr project. From the comments on the issue, calling it a bug seems a bit strong; it's more of a missing optimization.

Seashore answered 30/4, 2012 at 11:39 Comment(3)
If I had a dollar for every time a programmer told me, "My code doesn't work. It must be a bug in the framework (or compiler or runtime library, etc.)," and later found out that it was a bug in his own code, I could retire.Prizefight
@Jim: I've seen it plenty of times myself. The best antidote I know is to isolate the behavior as fundamentally as possible and provide the vendor with a repro. And to keep a wait and see attitude. That's where we're at.Seashore
@TankorSmash Perhaps Microsoft deleted it as part of moving the code to GitHub. It's not on my Connect dashboard anymore. It seems that many issues I reported are gone. Some sort of notification would have been nice. I resubmitted the issue to the GitHub project and updated the answer accordingly.Seashore
C
2

I think this is related to your other question. When I change your code as follows, the multi-line version wins.

oops, only on x86. On x64, multi-line is the slowest and the conditional beats them both handily.

class Program
{
    static void Main()
    {
        ConditionalTest();
        SingleLineTest();
        MultiLineTest();
        ConditionalTest();
        SingleLineTest();
        MultiLineTest();
        ConditionalTest();
        SingleLineTest();
        MultiLineTest();
    }

    public static void ConditionalTest()
    {
        Stopwatch stopwatch = new Stopwatch();
        stopwatch.Start();
        int count = 0;
        for (uint i = 0; i < 1000000000; ++i) {
            if (i % 16 == 0) ++count;
        }
        stopwatch.Stop();
        Console.WriteLine("Conditional test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
    }

    public static void SingleLineTest()
    {
        Stopwatch stopwatch = new Stopwatch();
        stopwatch.Start();
        int count = 0;
        for (uint i = 0; i < 1000000000; ++i) {
            count += i % 16 == 0 ? 1 : 0;
        }
        stopwatch.Stop();
        Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
    }

    public static void MultiLineTest()
    {
        Stopwatch stopwatch = new Stopwatch();
        stopwatch.Start();
        int count = 0;
        for (uint i = 0; i < 1000000000; ++i) {
            var isMultipleOf16 = i % 16 == 0;
            count += isMultipleOf16 ? 1 : 0;
        }
        stopwatch.Stop();
        Console.WriteLine("Multi-line test  --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
    }
}
Cyanate answered 2/5, 2012 at 4:1 Comment(1)
I updated the repro project to include an "if" test. I made single- and multi-line variants. On both x64 and x86, the single-line version is faster (when there is no alignment penalty). I also made a variants that don't have any conditional code in the loop at all (just bit math). They're neck and neck on x86 (I haven't checked the assembly). On x64, the version with a local variable runs faster! It's still amazing that the local variable ever matters at all.Seashore
R
1

I tend to think of it like this: the people who work on the compiler can only do so much stuff per year. If in that time they could implement lambdas or lots of classical optimizations, I'd vote for lambdas. C# is a language that's efficient in terms of code reading and writing effort, rather than in terms of execution time.

So it's reasonable for the team to concentrate on features that maximize the reading/writing efficiency, rather than the execution efficiency in a certain corner case (of which there are probably thousands).

Initially, I believe, the idea was that the JITter would do all the optimization. Unfortunately the JITting takes noticeable amounts of time, and any advanced optimizations will make it worse. So that didn't work out as well as one might have hoped.

One thing I found about programming really fast code in C# is that quite often you hit a severe GC bottleneck before any optimization like you mention would make a difference. Like if you allocate millions of objects. C# leaves you very little in terms of avoiding the cost: you can use arrays of structs instead, but the resulting code is really ugly in comparison. My point being that many other decisions about C# and .NET make such specific optimizations less worthwhile than they would be in something like a C++ compiler. Heck, they even dropped the CPU-specific optimizations in NGEN, trading performance for programmer (debugger) efficiency.

Having said all this, I'd love C# that actually made use of optimizations that C++ made use of since the 1990s. Just not at the expense of features like, say, async/await.

Rozina answered 6/5, 2012 at 19:23 Comment(1)
I'd be extremely wary of reading too much into articles from 2005 and .net 1.1! Quite a lot has changed in the last 7 years.Atheistic

© 2022 - 2024 — McMap. All rights reserved.