Why is casting a struct via Pointer slow, while Unsafe.As is fast?
Asked Answered
S

2

15

Background

I wanted to make a few integer-sized structs (i.e. 32 and 64 bits) that are easily convertible to/from primitive unmanaged types of the same size (i.e. Int32 and UInt32 for 32-bit-sized struct in particular).

The structs would then expose additional functionality for bit manipulation / indexing that is not available on integer types directly. Basically, as a sort of syntactic sugar, improving readability and ease of use.

The important part, however, was performance, in that there should essentially be 0 cost for this extra abstraction (at the end of the day the CPU should "see" the same bits as if it was dealing with primitive ints).

Sample Struct

Below is just the very basic struct I came up with. It does not have all the functionality, but enough to illustrate my questions:

[StructLayout(LayoutKind.Explicit, Pack = 1, Size = 4)]
public struct Mask32 {
  [FieldOffset(3)]
  public byte Byte1;
  [FieldOffset(2)]
  public ushort UShort1;
  [FieldOffset(2)]
  public byte Byte2;
  [FieldOffset(1)]
  public byte Byte3;
  [FieldOffset(0)]
  public ushort UShort2;
  [FieldOffset(0)]
  public byte Byte4;

  [DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
  public static unsafe implicit operator Mask32(int i) => *(Mask32*)&i;
  [DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
  public static unsafe implicit operator Mask32(uint i) => *(Mask32*)&i;
}

The Test

I wanted to test the performance of this struct. In particular I wanted to see if it could let me get the individual bytes just as quickly if I were to use regular bitwise arithmetic: (i >> 8) & 0xFF (to get the 3rd byte for example).

Below you will see a benchmark I came up with:

public unsafe class MyBenchmark {

  const int count = 50000;

  [Benchmark(Baseline = true)]
  public static void Direct() {
    var j = 0;
    for (int i = 0; i < count; i++) {
      //var b1 = i.Byte1();
      //var b2 = i.Byte2();
      var b3 = i.Byte3();
      //var b4 = i.Byte4();
      j += b3;
    }
  }


  [Benchmark]
  public static void ViaStructPointer() {
    var j = 0;
    int i = 0;
    var s = (Mask32*)&i;
    for (; i < count; i++) {
      //var b1 = s->Byte1;
      //var b2 = s->Byte2;
      var b3 = s->Byte3;
      //var b4 = s->Byte4;
      j += b3;
    }
  }

  [Benchmark]
  public static void ViaStructPointer2() {
    var j = 0;
    int i = 0;
    for (; i < count; i++) {
      var s = *(Mask32*)&i;
      //var b1 = s.Byte1;
      //var b2 = s.Byte2;
      var b3 = s.Byte3;
      //var b4 = s.Byte4;
      j += b3;
    }
  }

  [Benchmark]
  public static void ViaStructCast() {
    var j = 0;
    for (int i = 0; i < count; i++) {
      Mask32 m = i;
      //var b1 = m.Byte1;
      //var b2 = m.Byte2;
      var b3 = m.Byte3;
      //var b4 = m.Byte4;
      j += b3;
    }
  }

  [Benchmark]
  public static void ViaUnsafeAs() {
    var j = 0;
    for (int i = 0; i < count; i++) {
      var m = Unsafe.As<int, Mask32>(ref i);
      //var b1 = m.Byte1;
      //var b2 = m.Byte2;
      var b3 = m.Byte3;
      //var b4 = m.Byte4;
      j += b3;
    }
  }

}

The Byte1(), Byte2(), Byte3(), and Byte4() are just the extension methods that do get inlined and simply get the n-th byte by doing bitwise operations and casting:

[DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte Byte1(this int it) => (byte)(it >> 24);
[DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte Byte2(this int it) => (byte)((it >> 16) & 0xFF);
[DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte Byte3(this int it) => (byte)((it >> 8) & 0xFF);
[DebuggerStepThrough, MethodImpl(MethodImplOptions.AggressiveInlining)]
public static byte Byte4(this int it) => (byte)it;

EDIT: Fixed the code to make sure variables are actually used. Also commented out 3 of 4 variables to really test struct casting / member access rather than actually using the variables.

The Results

I ran these in the Release build with optimizations on x64.

Intel Core i7-3770K CPU 3.50GHz (Ivy Bridge), 1 CPU, 8 logical cores and 4 physical cores
Frequency=3410223 Hz, Resolution=293.2360 ns, Timer=TSC
  [Host]     : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.6.1086.0
  DefaultJob : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.6.1086.0


            Method |      Mean |     Error |    StdDev | Scaled | ScaledSD |
------------------ |----------:|----------:|----------:|-------:|---------:|
            Direct |  14.47 us | 0.3314 us | 0.2938 us |   1.00 |     0.00 |
  ViaStructPointer | 111.32 us | 0.6481 us | 0.6062 us |   7.70 |     0.15 |
 ViaStructPointer2 | 102.31 us | 0.7632 us | 0.7139 us |   7.07 |     0.14 |
     ViaStructCast |  29.00 us | 0.3159 us | 0.2800 us |   2.01 |     0.04 |
       ViaUnsafeAs |  14.32 us | 0.0955 us | 0.0894 us |   0.99 |     0.02 |

EDIT: New results after fixing the code:

            Method |      Mean |     Error |    StdDev | Scaled | ScaledSD |
------------------ |----------:|----------:|----------:|-------:|---------:|
            Direct |  57.51 us | 1.1070 us | 1.0355 us |   1.00 |     0.00 |
  ViaStructPointer | 203.20 us | 3.9830 us | 3.5308 us |   3.53 |     0.08 |
 ViaStructPointer2 | 198.08 us | 1.8411 us | 1.6321 us |   3.45 |     0.06 |
     ViaStructCast |  79.68 us | 1.5478 us | 1.7824 us |   1.39 |     0.04 |
       ViaUnsafeAs |  57.01 us | 0.8266 us | 0.6902 us |   0.99 |     0.02 |

Questions

The benchmark results were surprising for me, and that's why I have a few questions:

EDIT: Fewer questions remain after altering the code so that the variables actually get used.

  1. Why is the pointer stuff so slow?
  2. Why is the cast taking twice as long as the baseline case? Aren't implicit/explicit operators inlined?
  3. How come the new System.Runtime.CompilerServices.Unsafe package (v. 4.5.0) is so fast? I thought it would at least involve a method call...
  4. More generally, how can I make essentially a zero-cost struct that would simply act as a "window" onto some memory or a biggish primitive type like UInt64 so that I can more effectively manipulate / read that memory? What's the best practice here?
Sepulture answered 15/6, 2018 at 7:27 Comment(16)
Just to check: are you running these tests as a release build outside the debugger?Knowable
how can I make essentially a zero-cost struct that would simply act as a "window" The new Span<> and Memory<> should be for this, I thinkSamuella
@Samuella These are too general purpose. I meant a specific ones, like say PixelRGBA just as an example.Sepulture
@MatthewWatson Yes, release build x64.Sepulture
It would be great if you could post a compilable console app. As it is, this stuff won't compile and I imagine most people aren't going spend the time trying to fix it up...Knowable
I tried a couple of your test methods, and the compiler is removing the code from the loop, since the variables in the loop are not used. Therefore I suspect that your tests are meaningless.Knowable
I was going to write exactly what Matthew Watson wrote... Normally I would add a if like if (b1 + b2 + b3 + b4 == int.MaxValue) { throw new Exception(); } or something similar (if you knowSamuella
@MatthewWatson Thanks for pointing this out. I will check this out. Though, even if that's the case the questions still remain, since in all 5 tests none of the variables are used, so by that token they should all produce the same results. Working on the console app as you suggested. Though not sure how to post it here to SO.Sepulture
@FitDev Some side effects that take time remain after optimization... others are entirely removed.Samuella
@Claies Because sometimes you are programming using a language and you really really need to do something else, but you still want to use the language you used for 99% of the code.Samuella
@Claies C# is extremely efficient at it - and it's very nice to be able to do stuff at the same speed as an umanaged program. This is the entire reason that the new Memory<T>, Span<T> and the like were added to the language.Knowable
@MatthewWatson Thank you for your comments! I made some changes to the code and re-run the benchmark. The results are still somewhat strange. I am still not sure why the pointer-like casting is a lot slower and how is it that Unsafe.As is fast.Sepulture
@Samuella Thank you for the comments, I did change the code to make sure variables are used. The results are still somewhat strange. I am still not sure why the pointer-like casting is a lot slower and how is it that Unsafe.As is fast.Sepulture
@FitDev You are still doing operations without side-effects. Simply modifying a local variable doesn't cause side-effects. An intelligent compiler could remove it all. It woks this way: if a local variable is written but not read, then it is useless and can be removed. You need to do a if with a side-effect (like the throw), or add the value to a static variable or something similar.Samuella
@Samuella Thanks for pointing this out. I guess I should introduce branching. But I am loath on throws because that for one thing is known to prevent inlining.Sepulture
@FitDev Other solution is doing what Matthew Watson did in his code: return the local variable with the value. The code then can't be simply removed (but technically the JIT could inline aggressively and remove everything, because even he isn't using the return value). The static variable is the best way. There is no simple way for the compiler or the JIT to remove a write to a static variable (showing that no one is using it is very very difficult)Samuella
K
18

The answer to this appears to be that the JIT compiler can make certain optimisations better when you are using Unsafe.As().

Unsafe.As() is implemented very simply like this:

public static ref TTo As<TFrom, TTo>(ref TFrom source)
{
    return ref source;
}

That's it!

Here's a test program I wrote to compare that with casting:

using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

namespace Demo
{
    [StructLayout(LayoutKind.Explicit, Pack = 1, Size = 4)]
    public struct Mask32
    {
        [FieldOffset(3)]
        public byte Byte1;
        [FieldOffset(2)]
        public ushort UShort1;
        [FieldOffset(2)]
        public byte Byte2;
        [FieldOffset(1)]
        public byte Byte3;
        [FieldOffset(0)]
        public ushort UShort2;
        [FieldOffset(0)]
        public byte Byte4;
    }

    public static unsafe class Program
    {
        static int count = 50000000;

        public static int ViaStructPointer()
        {
            int total = 0;

            for (int i = 0; i < count; i++)
            {
                var s = (Mask32*)&i;
                total += s->Byte1;
            }

            return total;
        }

        public static int ViaUnsafeAs()
        {
            int total = 0;

            for (int i = 0; i < count; i++)
            {
                var m = Unsafe.As<int, Mask32>(ref i);
                total += m.Byte1;
            }

            return total;
        }

        public static void Main(string[] args)
        {
            var sw = new Stopwatch();

            sw.Restart();
            ViaStructPointer();
            Console.WriteLine("ViaStructPointer took " + sw.Elapsed);

            sw.Restart();
            ViaUnsafeAs();
            Console.WriteLine("ViaUnsafeAs took " + sw.Elapsed);
        }
    }
}

The results I get on my PC (x64 release build) are as follows:

ViaStructPointer took 00:00:00.1314279
ViaUnsafeAs took 00:00:00.0249446

As you can see, ViaUnsafeAs is indeed much quicker.

So let's look at what the compiler has generated:

public static unsafe int ViaStructPointer()
{
    int total = 0;
    for (int i = 0; i < Program.count; i++)
    {
        total += (*(Mask32*)(&i)).Byte1;
    }
    return total;
}

public static int ViaUnsafeAs()
{
    int total = 0;
    for (int i = 0; i < Program.count; i++)
    {
        total += (Unsafe.As<int, Mask32>(ref i)).Byte1;
    }
    return total;
}   

OK, there's nothing obvious there. But what about the IL?

.method public hidebysig static int32 ViaStructPointer () cil managed 
{
    .locals init (
        [0] int32 total,
        [1] int32 i,
        [2] valuetype Demo.Mask32* s
    )

    IL_0000: ldc.i4.0
    IL_0001: stloc.0
    IL_0002: ldc.i4.0
    IL_0003: stloc.1
    IL_0004: br.s IL_0017
    .loop
    {
        IL_0006: ldloca.s i
        IL_0008: conv.u
        IL_0009: stloc.2
        IL_000a: ldloc.0
        IL_000b: ldloc.2
        IL_000c: ldfld uint8 Demo.Mask32::Byte1
        IL_0011: add
        IL_0012: stloc.0
        IL_0013: ldloc.1
        IL_0014: ldc.i4.1
        IL_0015: add
        IL_0016: stloc.1

        IL_0017: ldloc.1
        IL_0018: ldsfld int32 Demo.Program::count
        IL_001d: blt.s IL_0006
    }

    IL_001f: ldloc.0
    IL_0020: ret
}

.method public hidebysig static int32 ViaUnsafeAs () cil managed 
{
    .locals init (
        [0] int32 total,
        [1] int32 i,
        [2] valuetype Demo.Mask32 m
    )

    IL_0000: ldc.i4.0
    IL_0001: stloc.0
    IL_0002: ldc.i4.0
    IL_0003: stloc.1
    IL_0004: br.s IL_0020
    .loop
    {
        IL_0006: ldloca.s i
        IL_0008: call valuetype Demo.Mask32& [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::As<int32, valuetype Demo.Mask32>(!!0&)
        IL_000d: ldobj Demo.Mask32
        IL_0012: stloc.2
        IL_0013: ldloc.0
        IL_0014: ldloc.2
        IL_0015: ldfld uint8 Demo.Mask32::Byte1
        IL_001a: add
        IL_001b: stloc.0
        IL_001c: ldloc.1
        IL_001d: ldc.i4.1
        IL_001e: add
        IL_001f: stloc.1

        IL_0020: ldloc.1
        IL_0021: ldsfld int32 Demo.Program::count
        IL_0026: blt.s IL_0006
    }

    IL_0028: ldloc.0
    IL_0029: ret
}

Aha! The only difference here is this:

ViaStructPointer: conv.u
ViaUnsafeAs:      call valuetype Demo.Mask32& [System.Runtime.CompilerServices.Unsafe]System.Runtime.CompilerServices.Unsafe::As<int32, valuetype Demo.Mask32>(!!0&)
                  ldobj Demo.Mask32

On the face of it, you would expect conv.u to be faster than the two instructions used for Unsafe.As. However, it seems that the JIT compiler is able to optimise those two instructions much better than the single conv.u.

It's reasonable to ask why that is - unfortunately I don't have an answer to that yet! I'm almost certain that the call to Unsafe::As<>() is being inlined by the JITTER, and it is being further optimised by the JIT.

There is some information about the Unsafe class' optimisations here.

Note that the IL generated for Unsafe.As<> is simply this:

.method public hidebysig static !!TTo& As<TFrom, TTo> (
        !!TFrom& source
    ) cil managed aggressiveinlining 
{
    .custom instance void System.Runtime.Versioning.NonVersionableAttribute::.ctor() = (
        01 00 00 00
    )
    IL_0000: ldarg.0
    IL_0001: ret
}

Now I think it becomes clearer as to why that can be optimised so well by the JITTER.

Knowable answered 15/6, 2018 at 8:21 Comment(6)
Thank you so much for your thorough answer and suggestions! So, would you say then that it is a safe bet (in regards to both .NET Framework 4.6.1+ and .NET Core 2+) to rely on Unsafe.As<> for performance reasons and generally try to avoid casting using pointers?Sepulture
@FitDev Yes, I think so; they've designed it exactly for that sort of purpose!Knowable
@MatthewWatson I look into the source code and found this: Unsafe.cs,88. Why they throw instead of return?Metritis
@Metritis Because it's not supported on that platform.Knowable
@MatthewWatson isn't it because they are intrinsics (see the attribute on each method)? the body of the method would be replaced by the jit/execution environment (or at least Thats how I read it). See the comment at the top of that linkSofer
The function public static ref TTo As<TFrom, TTo>(ref TFrom source) is an intrinsic and cannot be implemented like you suggest. The return type is wrong. The function is implemented in the CLRAurum
S
16

When you take the address of a local the jit generally has to keep that local on the stack. That's the case here. In the ViaPointer version i is kept on the stack. In the ViaUnsafe, i is copied to a temp and the temp is kept on the stack. The former is slower because i is also used to control the iteration of the loop.

You can get pretty close to the ViaUnsafe perf with the following code where you explicitly make a copy:

    public static int ViaStructPointer2()
    {
        int total = 0;

        for (int i = 0; i < count; i++)
        {
            int j = i;
            var s = (Mask32*)&j;
            total += s->Byte1;
        }

        return total;
    }

ViaStructPointer  took 00:00:00.1147793
ViaUnsafeAs       took 00:00:00.0282828
ViaStructPointer2 took 00:00:00.0257589
Silicic answered 21/6, 2018 at 8:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.