Why is my operator ++ more than twice as fast as its equivalent instance method?
Asked Answered
M

2

5

I'm running BenchmarkDotNet against the following code on .NET 8:

using System.Runtime.InteropServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

[StructLayout(LayoutKind.Explicit)]
public readonly struct Counter
{
    [FieldOffset(0)] public readonly int Value;
    [FieldOffset(1)] public readonly byte MidByte;

    public Counter(int value)
        : this()
    {
        this.Value = value;
    }

    public Counter Increment(int count) // 0 <= count <= 32
    {
        int value = this.Value + count;
        int lowByte = (byte)value;
        if (lowByte > 32)
            value = value - 32 + (this.MidByte < 16 ? 0x100 : 0xF100);
        return new Counter(value);
    }

    public static Counter operator ++(Counter counter) => counter.Increment(1);
}

public class Benchmarks
{
    [Benchmark]
    public Counter Benchmark1()
    {
        Counter c = new(0x10101);
        for (int i = 0; i < 10000; ++i) {
            c = c.Increment(1);
        }
        return c;
    }

    [Benchmark]
    public Counter Benchmark2()
    {
        Counter c = new(0x10101);
        for (int i = 0; i < 10000; ++i) {
            ++c;
        }
        return c;
    }
}

The only difference between Benchmark1 and Benchmark2 is that Benchmark2 calls operator ++ (which in turn calls Increment(1)), whereas Benchmark1 just calls Increment(1) directly.

Because the JIT compiler will likely inline operator ++, I expected these two benchmarks to perform the same. To my great surprise and bafflement, ++c absolutely demolishes c = c.Increment(1):

BenchmarkDotNet v0.13.8, Windows 10 (10.0.19045.4651/22H2/2022Update)
11th Gen Intel Core i7-11800H 2.30GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.303
 [Host] : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2
 DefaultJob : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2

Method Mean Error StdDev Code Size
Benchmark1 18.376 us 0.1999 us 0.1870 us 68 B
Benchmark2 6.564 us 0.0436 us 0.0408 us 81 B

Why is ++c so much faster than c.Increment(1) when the former simply calls the latter?


UPDATE #1: A couple commenters questioned whether ++c simply discards the result instead of updating c. I verified that both Benchmark1 and Benchmark2 return the same Counter value (0x00140911), which proves that both benchmarks perform the same 10000 calculations.


UPDATE #2: If I replace the reference to this.MidByte in Counter.Increment with the equivalent expression (byte)(value >> 8), then the performance gap between Benchmark1 and Benchmark2 vanishes:

Method Mean Error StdDev
Benchmark1 4.662 us 0.0489 us 0.0433 us
Benchmark2 4.595 us 0.0399 us 0.0373 us
Maricruzmaridel answered 29/7 at 14:18 Comment(7)
It does generate different IL code. See sharplab.io - put your code to the left, set output to IL on release. Generated Link is too long for a comment.Extractive
I get Benchmark2 50% slower than Benchmark1 somewhat expectedly (.net6 and .net7)Undersized
@Fildor It being a struct or class is irrelevant here, this is just how the ++ operator works.Eyelash
@Eyelash ... reading the specs right now. I totally thought that wouldn't work in a ref type ... I was wrong ... :oPapyrus
Like Ivan Petrov, I cannot reproduce the results either. 8.5 us for Benchmark 1 and 17.3 for 2. I'm running on .NET 8, MacBook Pro, M3 chip.Kerbela
@IvanPetrov and @Sweeper: That's interesting. On .NET 6, I get 26.834 us for Benchmark1 and 7.281 us for Benchmark2. In your case, 50% or more in the other direction is also a large difference. I wonder if you'd get the same result if you applied [MethodImpl(MethodImplOptions.AggressiveInlining)] to operator ++.Maricruzmaridel
No repro. I get 3.4 and 3.7 us. Core i7-1260PLitigant
U
6

There is definitely a difference in the JITted code between .NET 6 and .NET 8. And I think it's due to

  1. Struct copy semantics with temp stack variables being optimized away better in .NET8. There is hint at this in this github comment from 2019:

There are some correctness concerns related to exception semantics and self-modification that require staging some struct operations through temps. The copies (and temps) can sometimes be optimized away later after global (method-wide) analysis. The jit does not do this analysis yet, but hopefully will soon -- see the plans for first-class structs.

and also this issue and the comments/linked issues.

  1. Having the struct's layout set as explicit with a method that uses the explicitly offset field (MidByte). This github issue from 2023 seems very relevant to our case. Especially the two comments:

Explicit layouts, especially those with overlapping fields, have historically been pessimized in their codegen quality.

In .NET 8 we made improvements in this area, so the codegen might already be improved for you. In particular the JIT is able to reason about structs with overlapping fields if you aren't mixing accesses within the same function.

  1. (My observation/speculation from sharplab x86 assembly code analysis for our case to follow presently). There is a difference in how the JITter inlines static (the operator overload) vs instance methods (Increment).

First of all, since there is a lot of inlining going on, I've simplified the Increment method (still preserving the differences in emitted code) (full sharplab):

public Counter Increment(int count)
{
    int value = this.Value + count;    
    value = value + this.MidByte;
    return new Counter(value);
}

EDIT: The following only holds true for our very special case of having an explicit struct layout AND using the MidByte explitly laid out field in the method. If you experiment in the sharplab I linked you'd find that the JIT compiler produces identical code for Benchmark1 and Benchmark2 if we have default sequential layout (if you manage to find a corner case too there please comment)

There is no difference between .NET 6 and .NET 8, for how Counter.Increment and Counter.op_Increment are jitted:

Counter.Increment(Int32)
    L0000: push eax
    L0001: add edx, [ecx]
    L0003: movzx eax, byte ptr [ecx+1]
    L0007: add eax, edx
    L0009: mov [esp], eax
    L000c: pop ecx
    L000d: ret

Counter.op_Increment(Counter)
    L0000: push eax
    L0001: mov eax, [esp+8]
    L0005: movzx edx, byte ptr [esp+9]
    L000a: lea eax, [eax+edx+1]
    L000e: mov [esp], eax
    L0011: pop ecx
    L0012: ret 4

Even at this point the call to Counter_Increment is inlined in the op_Increment method.

The difference between the methods lies in the fact that the Counter.Increment is an instance method and relies on this being in [ecx] whereas the Counter.op_Increment(Counter) as a static method that has a reference to the the counter instance on the stack [esp+8]. Logic is the same otherwise as the instance method is really inlined in the static method.

There definitely seems to be an optimization in Benchmark1 (that makes calls to Counter.Increment(Int32)) between .NET 6 and .NET 8 (stripping the prologue/epilogue/loop code):

.NET 6 instance
L0006 mov dword ptr [ebp-4], 0x10101 
L000d xor eax, eax 
L0010 mov edx, [ebp-4] // inlinining directly on this (see Counter.Increment we have [ecx] = [ebp-4])
L0013 movzx ecx, byte ptr [ebp-3] 
L0017 lea edx, [edx+ecx+1] 
L001b mov [ebp-8], edx // struct copy back result semantics
L001e mov edx, [ebp-8] //
L0021 mov [ebp-4], edx // 


.NET 8 instance
L0004: mov dword ptr [ebp-4], 0x10101
L000b: xor eax, eax
L0010: mov edx, [ebp-4] // inlining on this
L0013: movzx ecx, byte ptr [ebp-3]
L0017: lea edx, [edx+ecx+1]
// struct copy semantics gone
// we are in place
L001b: mov [ebp-4], edx // still save the result in ebp-4

As you can see the unnecessary struct copying ceremony is optimized away.

Now to the case of Benchmark2:

.NET 6 static
L0006 mov dword ptr [ebp-4], 0x10101 
L000d xor eax, eax 
L000f mov edx, [ebp-4] 
L0012 mov [ebp-8], edx  // simulating the static call with copy semantics
L0015 mov edx, [ebp-8]  // Inside op_Increment
L0018 movzx ecx, byte ptr [ebp-7] 
L001c lea edx, [edx+ecx+1] 
L0020 mov [ebp-0xc], edx // copy semantic "return"
L0023 mov edx, [ebp-0xc] 
L0026 mov [ebp-4], edx // assigning back to the local variable


.NET 8
L0006: mov eax, 0x10101
L000b: xor edx, edx
L0010: mov [ebp-8], eax // simulating the static call
L0013: movzx ecx, byte ptr [ebp-7] // Inside op_Increment
L0017: lea eax, [eax+ecx+1]
// struct copy semantics gone

All the temp variables unnecessary copying is gone.

The interesting thing is that here we don't have the initial assignment to a local variable:

L0006 mov dword ptr [ebp-4], 0x10101 

instead we do:

L0006: mov eax, 0x10101

My speculation is that it's because we inline a static method which relies on mutation of a local variable and not this.

In Benchmark1 this was optimized to be [ebp-4] which was already declared, so we reused it.

In Benchmark2 we don't mutate this but we need to share a local variable from our method (Benchmark2) as a local variable for the static inlined method (Counter.op_Increment(Counter)). The static method also needs an argument. It normally got it from the stack, but here we are inlined, so the best thing is to receive it from a register eax. In the spirit of removing ceremony we omitted all the temp stack variables, so we can use only the register for holding counter.

Looking at the assembly of Benchmark1 vs Benchmark2 shows why the latter is faster:

Benchmark1()
L0004: mov dword ptr [ebp-4], 0x10101
L0010: mov edx, [ebp-4] // inlining on this
L0013: movzx ecx, byte ptr [ebp-3]
L0017: lea edx, [edx+ecx+1]
L001b: mov [ebp-4], edx 

Benchmark2()
L0006: mov eax, 0x10101
L0010: mov [ebp-8], eax // simulating the static call
L0013: movzx ecx, byte ptr [ebp-7] // Inside op_Increment
L0017: lea eax, [eax+ecx+1]

we have one more instruction in Benchmark1 because our "outer" variable is on the stack and not in a register.

Needless to say, this observed behavior can and probably would change in future versions of .NET as there are more plans to optimize structs.

Undersized answered 29/7 at 19:39 Comment(3)
+1 for the assembly analysis. Question: In a comment, you wrote that Benchmark2 is 50% slower than Benchmark1. Are you seeing different results in .NET 8?Maricruzmaridel
@MichaelLiu yes. I'm just updating my answer to reflect that you managed to find a very very special case most likely due to the explicit layout you set for the struct. If you play with the sharplab demo and don't have explicit layout, things are identical for Benchmark1 and Benchmark2, or at least I haven't managed to find a way to make them different.Undersized
Great answer. So a progression rather than a regression, but still some corner cases in newer versions of .NETSeduction
S
4

I can confirm there is definitely a difference in performance, at least when testing in .NET 7 using BenchmarkDotNet.

  • Increment is an instance method, so this is passed as a ref. Calling c = c.Increment(1); is effectively the following C code

    c = Increment(&c, 1);
    

    which means that the address of c is loaded onto the stack, to pass to the Increment function. I suspect this is preventing the CPU's (or the JIT's) optimizations, because it can't factor out the write back to c because it's been aliased.

  • Whereas, because the ++ operator uses a by-value parameter, calling ++c would be equivalent (after inlining) to the following code

    Counter c1 = c;
    c = Increment(&c1, 1);
    

    which means the c variable is not aliased within the Increment function.

    An equivalently slow version of the ++ operator would use Counter operator ++(in Counter counter) so that the pointer aliasing is passed through.

If there are differences across different .NET versions then it may be a regression in the JITter, so maybe file a bug report in that case.

Seduction answered 29/7 at 16:45 Comment(5)
+1 for the point about possible aliasing. I'd like a definitive explanation though. Also, it's strange (and worrisome) that other commentators saw ++c perform substantially worse than Increment(1).Maricruzmaridel
c = c++ is a no op which the JIT optimizes away, that's why it's so fast - it essentially only runs the loop without ever invoking IncrementUndersized
@MichaelLiu It may be CPU and .NET version dependent. I strongly suspect the aliasing issue, because an in parameter gave the same slow result.Seduction
It looks like the reference to this.MidByte in Counter.Increment has something to do with this. If I replace it with the equivalent (byte)(value >> 8), then both versions are equally fast.Maricruzmaridel
@MichaelLiu yeah, this is the main cause. Using the explicitly offset field triggers different JIT inlining.Undersized

© 2022 - 2024 — McMap. All rights reserved.