Why does adding an extra field to struct greatly improves its performance?

Asked 3/6, 2017 at 14:6 Answered 30/6, 2018 at 19:32

I noticed that a struct wrapping a single float is significantly slower than using a float directly, with approximately half of the performance.

using System;
using System.Diagnostics;

struct Vector1 {

    public float X;

    public Vector1(float x) {
        X = x;
    }

    public static Vector1 operator +(Vector1 a, Vector1 b) {
        a.X = a.X + b.X;
        return a;
    }
}

However, upon adding an additional 'extra' field, some magic seems to happen and performance once again becomes more reasonable:

struct Vector1Magic {

    public float X;
    private bool magic;

    public Vector1Magic(float x) {
        X = x;
        magic = true;
    }

    public static Vector1Magic operator +(Vector1Magic a, Vector1Magic b) {
        a.X = a.X + b.X;
        return a;
    }
}

The code I used to benchmark these is as follows:

class Program {
    static void Main(string[] args) {
        int iterationCount = 1000000000;
        var sw = new Stopwatch();
        sw.Start();
        var total = 0.0f;
        for (int i = 0; i < iterationCount; i++) {
            var v = (float) i;
            total = total + v;
        }
        sw.Stop();
        Console.WriteLine("Float time was {0} for {1} iterations.", sw.Elapsed, iterationCount);
        Console.WriteLine("total = {0}", total);
        sw.Reset();
        sw.Start();
        var totalV = new Vector1(0.0f);
        for (int i = 0; i < iterationCount; i++) {
            var v = new Vector1(i);
            totalV += v;
        }
        sw.Stop();
        Console.WriteLine("Vector1 time was {0} for {1} iterations.", sw.Elapsed, iterationCount);
        Console.WriteLine("totalV = {0}", totalV);
        sw.Reset();
        sw.Start();
        var totalVm = new Vector1Magic(0.0f);
        for (int i = 0; i < iterationCount; i++) {
            var vm = new Vector1Magic(i);
            totalVm += vm;
        }
        sw.Stop();
        Console.WriteLine("Vector1Magic time was {0} for {1} iterations.", sw.Elapsed, iterationCount);
        Console.WriteLine("totalVm = {0}", totalVm);
        Console.Read();
    }
}

With the benchmark results:

Float time was 00:00:02.2444910 for 1000000000 iterations.
Vector1 time was 00:00:04.4490656 for 1000000000 iterations.
Vector1Magic time was 00:00:02.2262701 for 1000000000 iterations.

Compiler/environment settings: OS: Windows 10 64 bit Toolchain: VS2017 Framework: .Net 4.6.2 Target: Any CPU Prefer 32 bit

If 64 bit is set as the target, our results are more predictable, but significantly worse than what we see with Vector1Magic on the 32 bit target:

Float time was 00:00:00.6800014 for 1000000000 iterations.
Vector1 time was 00:00:04.4572642 for 1000000000 iterations.
Vector1Magic time was 00:00:05.7806399 for 1000000000 iterations.

For the real wizards, I've included a dump of the IL here: https://pastebin.com/sz2QLGEx

Further investigation indicates that this seems to be specific to the windows runtime, as the mono compiler produces the same IL.

On the mono runtime, both struct variants have roughly 2x slower performance compared to the raw float. This is quite a bit different to the performance we see on .Net.

What's going on here?

*Note this question originally included a flawed benchmark process (Thanks Max Payne for pointing this out), and has been updated to more accurately reflect the timings.

Seeseebeck answered 3/6, 2017 at 14:6 Comment(5)

Im guessing this is due to the structs packing now having better memory alignment. – Consignment 3/6, 2017 at 14:12

You should add a warmup iteration to exclude possible interference from JIT or other one-time processing. – Statuesque 3/6, 2017 at 14:36

If I switch to 64 bit, I get worse performance for your "magic" vector. – Tellurate 3/6, 2017 at 14:37

I previously included a warmup period - it made no significant difference to this test. – Seeseebeck 3/6, 2017 at 14:37

I wish someone could figure this out. It's so counterintuitive that Vector1Magic would be faster. – Imperialism 15/6, 2017 at 17:57

The jit has an optimization known as "struct promotion" where it can effectively replace a struct local or argument with multiple locals, one for each of the struct's fields.

Struct promotion of a single struct-wrapped float however is disabled. The reasons are a bit obscure, but roughly:

structs that simply wrap primitive types are treated as integer values of the struct size when being passed to or returned from calls
during promotion analysis the jit can't tell if the struct is ever passed to or returned from a call.
the code sequences needed at calls to reclassify an int as a float (and vice versa) are thought to be expensive at runtime.
hence the struct is not promoted and so access and operations on the float field are a bit slower.

So roughly speaking the jit is prioritizing reducing the costs at call sites over improving the costs at places where the field is used. And sometimes (as in your case above, where operation costs predominate) this is not the right call.

As you have seen, if you make the struct larger then the rules for passing and returning the struct change (it is now passed returned by reference) and this unblocks promotion.

In the CoreCLR sources you can see this logic at play in Compiler::lvaShouldPromoteStructVar.

Achaemenid answered 30/6, 2018 at 19:32 Comment(0)

This should not happen. This is obviously some sort of missalignment that forces the JIT to not work like it should.

struct Vector1 //Works fast in 32 Bit 
{
    public double X;
}

struct Vector1 //Works fast in 64 Bit and 32 Bit
{
    public double X;
    public double X2;
}

You also must call: Console.WriteLine(total); that increases the time exactly to Vector1Magic time which makes sense. The question still holds, why Vector1 is so slow.

Maybe structs are not optimized for sizeof(foo) < 64 Bit in 64 bit mode.

It seems that this was ansered 7 years ago: Why is 16 byte the recommended size for struct in C#?

Crusade answered 3/6, 2017 at 19:58 Comment(5)

"This should not happen. This is obviously some sort of missalignment that forces the JIT to not work like it should. " - This doesn't really answer the question. Why does this happen? What is the reasoning behind this? – Seeseebeck 3/6, 2017 at 23:59

Then lets upvote until somebody comes around who knows how .net works internally. The produced IL code is good so reading that wont point us to the solution. The problem is deeper, inside the JIT optimizer. This is a very interesting find, maybe you could post this on the MSDN .net developer-team forum ? – Crusade 4/6, 2017 at 0:55

Please insert a line that says Console.WriteLine(total); after your first loop. The JIT wont execute nodes where the result is not used afterwards. – Crusade 4/6, 2017 at 1:6

Oh ok i tested it and it decreased the loop speed of float to that of magic2 – Crusade 4/6, 2017 at 10:57

Double-checked - it did change the results. Updating the question with that. – Seeseebeck 4/6, 2017 at 16:4

CIL code is identical (practically). But x86 assembly code is not.

I think, It's some feature of JIT compiler optimization.

Compiler generates following assembly code for Vector1.

C# (Partially Assembly x86 in comments):

var totalV = new Vector1(0.0f);
/*
01300576  fldz  
01300578  fstp        dword ptr [ebp-14h] 
*/
for (int i = 0; i < iterationCount; i++)
{
   var v = new Vector1(i);
   /*
   0130057D  mov         dword ptr [ebp-4Ch],ecx ; ecx - is index "i"
   01300580  fild        dword ptr [ebp-4Ch]
   01300583  fstp        dword ptr [ebp-4Ch]  
   01300586  fld         dword ptr [ebp-4Ch]
   */
   totalV += v;
   /*
   01300589  lea         eax,[ebp-14h]  
   0130058C  mov         eax,dword ptr [eax]  
   0130058E  lea         edx,[ebp-18h]  
   01300591  mov         dword ptr [edx],eax  
   01300593  fadd        dword ptr [ebp-18h]  
   01300596  fstp        dword ptr [ebp-18h]  
   01300599  mov         eax,dword ptr [ebp-18h]  
   0130059C  mov         dword ptr [ebp-14h],eax  
   */
}

Compiler generates following assembly code for Vector1Magic.

C# (Partially Assembly x86 in comments):

var totalVm = new Vector1Magic(0.0f);
/*
01300657  mov         byte ptr [ebp-20h],1  ; here's assignment "magic=true"
0130065B  fldz  
0130065D  fstp        dword ptr [ebp-1Ch]
*/
for (int i = 0; i < iterationCount; i++)
{
    var vm = new Vector1Magic(i);
    /*
    01300662  mov         dword ptr [ebp-4Ch],edx ; edx - is index "i"
    01300665  fild        dword ptr [ebp-4Ch]  
    01300668  fstp        dword ptr [ebp-4Ch]  
    0130066B  fld         dword ptr [ebp-4Ch]  
    */
    totalVm += vm;
    /*
    0130066E  movzx       ecx,byte ptr [ebp-20h] ; here's some work with "unused" magic field
    01300672  fld         dword ptr [ebp-1Ch]  
    01300675  faddp       st(1),st  
    01300677  fstp        dword ptr [ebp-1Ch]  
    0130067A  mov         byte ptr [ebp-20h],cl  ; here's some work with "unused" magic field
    */
}

Apparently this asm blocks affect performance:

;Vector1
01300589  lea         eax,[ebp-14h]  
0130058C  mov         eax,dword ptr [eax]  
0130058E  lea         edx,[ebp-18h]  
01300591  mov         dword ptr [edx],eax  
01300593  fadd        dword ptr [ebp-18h]  
01300596  fstp        dword ptr [ebp-18h]  
01300599  mov         eax,dword ptr [ebp-18h]  
0130059C  mov         dword ptr [ebp-14h],eax  

;Vector1Magic
0130066E  movzx       ecx,byte ptr [ebp-20h] ; here's some work with "unused" magic field
01300672  fld         dword ptr [ebp-1Ch]  
01300675  faddp       st(1),st  
01300677  fstp        dword ptr [ebp-1Ch]  
0130067A  mov         byte ptr [ebp-20h],cl  ; here's some work with "unused" magic field

JIT compiler processes operations differently on structures with one field and with several fields. Probably it expects in Vector1Magic operations with all fields (and "unused" too).

Screwball answered 11/6, 2018 at 21:42 Comment(0)

Recommended topics

Hot tags