Expensive to wrap System.Numerics.VectorX - why?
Asked Answered
D

2

25

TL;DR: Why is wrapping the System.Numerics.Vectors type expensive, and is there anything I can do about it?

Consider the following piece of code:

[MethodImpl(MethodImplOptions.NoInlining)]
private static long GetIt(long a, long b)
{
    var x = AddThem(a, b);
    return x;
}

private static long AddThem(long a, long b)
{
    return a + b;
}

This will JIT into (x64):

00007FFDA3F94500  lea         rax,[rcx+rdx]  
00007FFDA3F94504  ret  

and x86:

00EB2E20  push        ebp  
00EB2E21  mov         ebp,esp  
00EB2E23  mov         eax,dword ptr [ebp+10h]  
00EB2E26  mov         edx,dword ptr [ebp+14h]  
00EB2E29  add         eax,dword ptr [ebp+8]  
00EB2E2C  adc         edx,dword ptr [ebp+0Ch]  
00EB2E2F  pop         ebp  
00EB2E30  ret         10h  

Now, if I wrap this in a struct, e.g.

public struct SomeWrapper
{
    public long X;
    public SomeWrapper(long X) { this.X = X; }
    public static SomeWrapper operator +(SomeWrapper a, SomeWrapper b)
    {
        return new SomeWrapper(a.X + b.X);
    }
}

and change GetIt, e.g.

private static long GetIt(long a, long b)
{
    var x = AddThem(new SomeWrapper(a), new SomeWrapper(b)).X;
    return x;
}
private static SomeWrapper AddThem(SomeWrapper a, SomeWrapper b)
{
    return a + b;
}

the JITted result is still exactly the same as when using the native types directly (the AddThem, and the SomeWrapper overloaded operator and constructor are all inlined). As expected.

Now, if I try this with the SIMD-enabled types, e.g. System.Numerics.Vector4:

[MethodImpl(MethodImplOptions.NoInlining)]
private static Vector4 GetIt(Vector4 a, Vector4 b)
{
    var x = AddThem(a, b);
    return x;
}

it is JITted into:

00007FFDA3F94640  vmovupd     xmm0,xmmword ptr [rdx]  
00007FFDA3F94645  vmovupd     xmm1,xmmword ptr [r8]  
00007FFDA3F9464A  vaddps      xmm0,xmm0,xmm1  
00007FFDA3F9464F  vmovupd     xmmword ptr [rcx],xmm0  
00007FFDA3F94654  ret  

However, if I wrap the Vector4 in a struct (similar to the first example):

public struct SomeWrapper
{
    public Vector4 X;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public SomeWrapper(Vector4 X) { this.X = X; }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static SomeWrapper operator+(SomeWrapper a, SomeWrapper b)
    {
        return new SomeWrapper(a.X + b.X);
    }
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static Vector4 GetIt(Vector4 a, Vector4 b)
{
    var x = AddThem(new SomeWrapper(a), new SomeWrapper(b)).X;
    return x;
}

my code is now JITted into a whole lot more:

00007FFDA3F84A02  sub         rsp,0B8h  
00007FFDA3F84A09  mov         rsi,rcx  
00007FFDA3F84A0C  lea         rdi,[rsp+10h]  
00007FFDA3F84A11  mov         ecx,1Ch  
00007FFDA3F84A16  xor         eax,eax  
00007FFDA3F84A18  rep stos    dword ptr [rdi]  
00007FFDA3F84A1A  mov         rcx,rsi  
00007FFDA3F84A1D  vmovupd     xmm0,xmmword ptr [rdx]  
00007FFDA3F84A22  vmovupd     xmmword ptr [rsp+60h],xmm0  
00007FFDA3F84A29  vmovupd     xmm0,xmmword ptr [rsp+60h]  
00007FFDA3F84A30  lea         rax,[rsp+90h]  
00007FFDA3F84A38  vmovupd     xmmword ptr [rax],xmm0  
00007FFDA3F84A3D  vmovupd     xmm0,xmmword ptr [r8]  
00007FFDA3F84A42  vmovupd     xmmword ptr [rsp+50h],xmm0  
00007FFDA3F84A49  vmovupd     xmm0,xmmword ptr [rsp+50h]  
00007FFDA3F84A50  lea         rax,[rsp+80h]  
00007FFDA3F84A58  vmovupd     xmmword ptr [rax],xmm0  
00007FFDA3F84A5D  vmovdqu     xmm0,xmmword ptr [rsp+90h]  
00007FFDA3F84A67  vmovdqu     xmmword ptr [rsp+40h],xmm0  
00007FFDA3F84A6E  vmovdqu     xmm0,xmmword ptr [rsp+80h]  
00007FFDA3F84A78  vmovdqu     xmmword ptr [rsp+30h],xmm0  
00007FFDA3F84A7F  vmovdqu     xmm0,xmmword ptr [rsp+40h]  
00007FFDA3F84A86  vmovdqu     xmmword ptr [rsp+20h],xmm0  
00007FFDA3F84A8D  vmovdqu     xmm0,xmmword ptr [rsp+30h]  
00007FFDA3F84A94  vmovdqu     xmmword ptr [rsp+10h],xmm0  
00007FFDA3F84A9B  vmovups     xmm0,xmmword ptr [rsp+20h]  
00007FFDA3F84AA2  vmovups     xmm1,xmmword ptr [rsp+10h]  
00007FFDA3F84AA9  vaddps      xmm0,xmm0,xmm1  
00007FFDA3F84AAE  lea         rax,[rsp]  
00007FFDA3F84AB2  vmovupd     xmmword ptr [rax],xmm0  
00007FFDA3F84AB7  vmovdqu     xmm0,xmmword ptr [rsp]  
00007FFDA3F84ABD  vmovdqu     xmmword ptr [rsp+70h],xmm0  
00007FFDA3F84AC4  vmovups     xmm0,xmmword ptr [rsp+70h]  
00007FFDA3F84ACB  vmovupd     xmmword ptr [rsp+0A0h],xmm0  
00007FFDA3F84AD5  vmovupd     xmm0,xmmword ptr [rsp+0A0h]  
00007FFDA3F84ADF  vmovupd     xmmword ptr [rcx],xmm0  
00007FFDA3F84AE4  add         rsp,0B8h  
00007FFDA3F84AEB  pop         rsi  
00007FFDA3F84AEC  pop         rdi  
00007FFDA3F84AED  ret  

It looks like the JIT has now decided for some reason it can't just use the registers, and instead uses temporary variables, but I can't understand why. First I thought it might be an alignment issue, but then I can't understand why it is first loading both into xmm0 and then deciding to round trip to memory.

What is going on here? And more importantly, can I fix it?

The reason that I would like to wrap the structure like this is that I have a lot of legacy code that uses an API whose implementation would benefit from some SIMD goodness.

EDIT: So, after some digging around in the coreclr source, I found out that it is actually nothing special about the System.Numerics classes. I just have to add the System.Numerics.JitIntrinsic attribute to my methods. The JIT will then replace my implementation with its own. JitIntrinsic is private? No problem, just copy+paste it. The original question still remains though (even if I now have a workaround).

Demount answered 4/1, 2016 at 21:31 Comment(0)
A
2

Poor performance when wrapping Numerics.Vector was a compiler issue and the fix was committed to master on Jan 20 2017:

https://github.com/dotnet/coreclr/issues/7508

I don't know how propagation works exactly on this project, but it seems like the fix will be part of the 2.0.0 release.

Airsick answered 24/5, 2017 at 10:40 Comment(0)
G
0

The problem comes just from the fact that a Vector4 contains 4 longs and DirectX Vector4 contains 4 Floats. In each case passing vectors only to add Xs makes the code much more complex because W, Y and Z have to be copied even if unchanged. The vectors are copied during each "new SomeWrapper(v)" and outside the function a last time to affect the result to the variable.

Optimizing struct code is very tricky. With struct you save up heap allocation time, but because of multiple copies the code becomes more long.

Two things can help you :

1) Do not use wrappers but extension methods avoid copy into the wrapper.

2) Do not allocate new vectors to return values, but use one of them when possible (optimize code but do not help making the type invariant, like other arithmetic types, so use with extrem caution).

Sample:

struct Vector
{
    public long X;
    public long Y;
}

static class VectorExtension
{ 
    public static void AddToMe(this Vector v, long x, long y)
    {
        v.X += x;
        v.Y += y;
    }

    public static void AddToMe(this Vector v, Vector v2)
    {
        v.X += v2.X;
        v.Y += v2.Y;
    }
}
Godfry answered 6/5, 2016 at 16:7 Comment(1)
All fields are floats. The struct wrapping is inlined except in the SIMD case. The code in your example is generally not needed. My question is why it breaks down in the SIMD case? (As I wrote in my update, I was able to find an acceptable workaround)Demount

© 2022 - 2024 — McMap. All rights reserved.