To start with, this is not the same as Why is Func<> created from Expression> slower than Func<> declared directly? and is surprisingly just the opposite of it. Additionally, all links and questions that I have found while researching this issue all originate out of the 2010-2012 time period so I have decided to open a new question here to see if there is some discussion to be had around the current state of delegate behavior in the .NET ecosystem.
That said, I am using .NET Core 2.0 and .NET 4.7.1 and am seeing some curious performance metrics in regards to delegates that are created from a compiled expression versus delegates that are described and declared as a CLR object.
For some context on how I stumbled upon this issue, I was doing a test involving a selection of data in arrays of 1,000 and 10,000 objects, and noticed that if I used a compiled expression it was getting faster results across the board. I managed to boil this down to a very simple project that reproduces this issue which you can find here:
https://github.com/Mike-EEE/StackOverflow.Performance.Delegates
For the testing, I have two sets of benchmarks that are used that feature a compiled delegate paired with a declared delegate, resulting in four total core benchmarks.
The first delegate set is comprised of an empty delegate that returns a null string. The second set is a delegate that has a simple expression within it. I wanted to demonstrate that this issue occurs with the simplest of delegates as well as ones with a defined body within it.
These tests are then run on the CLR runtime and the .NET Core runtime via the excellent Benchmark.NET performance product, resulting in eight total benchmarks. Additionally, I also make use of the just-as-excellent Benchmark.NET disassembly diagnoser to emit the disassembly encountered during the JIT of the benchmark measurements. I share the results of this below.
Here is the code that runs the benchmarks. You can see that it is very straight-forward:
[CoreJob, ClrJob, DisassemblyDiagnoser(true, printSource: true)]
public class Delegates
{
readonly DelegatePair<string, string> _empty;
readonly DelegatePair<string, int> _expression;
readonly string _message;
public Delegates() : this(new DelegatePair<string, string>(_ => default, _ => default),
new DelegatePair<string, int>(x => x.Length, x => x.Length)) {}
public Delegates(DelegatePair<string, string> empty, DelegatePair<string, int> expression,
string message = "Hello World!")
{
_empty = empty;
_expression = expression;
_message = message;
EmptyDeclared();
EmptyCompiled();
ExpressionDeclared();
ExpressionCompiled();
}
[Benchmark]
public void EmptyDeclared() => _empty.Declared(default);
[Benchmark]
public void EmptyCompiled() => _empty.Compiled(default);
[Benchmark]
public void ExpressionDeclared() => _expression.Declared(_message);
[Benchmark]
public void ExpressionCompiled() => _expression.Compiled(_message);
}
These are the results I see in Benchmark.NET:
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.371 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-4820K CPU 3.70GHz (Haswell), 1 CPU, 8 logical and 8 physical cores
.NET Core SDK=2.1.300-preview2-008533
[Host] : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT
Clr : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2633.0
Core : .NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT
Method | Job | Runtime | Mean | Error | StdDev |
------------------- |----- |-------- |----------:|----------:|----------:|
EmptyDeclared | Clr | Clr | 1.3691 ns | 0.0302 ns | 0.0282 ns |
EmptyCompiled | Clr | Clr | 1.1851 ns | 0.0381 ns | 0.0357 ns |
ExpressionDeclared | Clr | Clr | 1.3805 ns | 0.0314 ns | 0.0294 ns |
ExpressionCompiled | Clr | Clr | 1.1431 ns | 0.0396 ns | 0.0371 ns |
EmptyDeclared | Core | Core | 1.5733 ns | 0.0329 ns | 0.0308 ns |
EmptyCompiled | Core | Core | 0.9326 ns | 0.0275 ns | 0.0244 ns |
ExpressionDeclared | Core | Core | 1.6040 ns | 0.0394 ns | 0.0368 ns |
ExpressionCompiled | Core | Core | 0.9380 ns | 0.0485 ns | 0.0631 ns |
Do note that the benchmarks that make use of a compiled delegate are consistently faster.
Finally, here are the results of the disassembly encountered for each benchmark:
<style type="text/css">
table { border-collapse: collapse; display: block; width: 100%; overflow: auto; }
td, th { padding: 6px 13px; border: 1px solid #ddd; }
tr { background-color: #fff; border-top: 1px solid #ccc; }
tr:nth-child(even) { background: #f8f8f8; }
</style>
</head>
<body>
<table>
<thead>
<tr><th colspan="2">Delegates.EmptyDeclared</th></tr>
<tr>
<th>.NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2633.0</th>
<th>.NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align:top;"><pre><code>
00007ffd`4f8f0ea0 StackOverflow.Performance.Delegates.Delegates.EmptyDeclared()
public void EmptyDeclared() => _empty.Declared(default);
^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`4f8f0ea4 4883c110 add rcx,10h
00007ffd`4f8f0ea8 488b01 mov rax,qword ptr [rcx]
00007ffd`4f8f0eab 488b4808 mov rcx,qword ptr [rax+8]
00007ffd`4f8f0eaf 33d2 xor edx,edx
00007ffd`4f8f0eb1 ff5018 call qword ptr [rax+18h]
00007ffd`4f8f0eb4 90 nop
</code></pre></td>
<td style="vertical-align:top;"><pre><code>
00007ffd`39c8d8b0 StackOverflow.Performance.Delegates.Delegates.EmptyDeclared()
public void EmptyDeclared() => _empty.Declared(default);
^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`39c8d8b4 4883c110 add rcx,10h
00007ffd`39c8d8b8 488b01 mov rax,qword ptr [rcx]
00007ffd`39c8d8bb 488b4808 mov rcx,qword ptr [rax+8]
00007ffd`39c8d8bf 33d2 xor edx,edx
00007ffd`39c8d8c1 ff5018 call qword ptr [rax+18h]
00007ffd`39c8d8c4 90 nop
</code></pre></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr><th colspan="2">Delegates.EmptyCompiled</th></tr>
<tr>
<th>.NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2633.0</th>
<th>.NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align:top;"><pre><code>
00007ffd`4f8e0ef0 StackOverflow.Performance.Delegates.Delegates.EmptyCompiled()
public void EmptyCompiled() => _empty.Compiled(default);
^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`4f8e0ef4 4883c110 add rcx,10h
00007ffd`4f8e0ef8 488b4108 mov rax,qword ptr [rcx+8]
00007ffd`4f8e0efc 488b4808 mov rcx,qword ptr [rax+8]
00007ffd`4f8e0f00 33d2 xor edx,edx
00007ffd`4f8e0f02 ff5018 call qword ptr [rax+18h]
00007ffd`4f8e0f05 90 nop
</code></pre></td>
<td style="vertical-align:top;"><pre><code>
00007ffd`39c8d900 StackOverflow.Performance.Delegates.Delegates.EmptyCompiled()
public void EmptyCompiled() => _empty.Compiled(default);
^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`39c8d904 4883c110 add rcx,10h
00007ffd`39c8d908 488b4108 mov rax,qword ptr [rcx+8]
00007ffd`39c8d90c 488b4808 mov rcx,qword ptr [rax+8]
00007ffd`39c8d910 33d2 xor edx,edx
00007ffd`39c8d912 ff5018 call qword ptr [rax+18h]
00007ffd`39c8d915 90 nop
</code></pre></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr><th colspan="2">Delegates.ExpressionDeclared</th></tr>
<tr>
<th>.NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2633.0</th>
<th>.NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align:top;"><pre><code>
00007ffd`4f8e0f20 StackOverflow.Performance.Delegates.Delegates.ExpressionDeclared()
public void ExpressionDeclared() => _expression.Declared(_message);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`4f8e0f24 488d5120 lea rdx,[rcx+20h]
00007ffd`4f8e0f28 488b02 mov rax,qword ptr [rdx]
00007ffd`4f8e0f2b 488b5108 mov rdx,qword ptr [rcx+8]
00007ffd`4f8e0f2f 488b4808 mov rcx,qword ptr [rax+8]
00007ffd`4f8e0f33 ff5018 call qword ptr [rax+18h]
00007ffd`4f8e0f36 90 nop
</code></pre></td>
<td style="vertical-align:top;"><pre><code>
00007ffd`39c9d930 StackOverflow.Performance.Delegates.Delegates.ExpressionDeclared()
public void ExpressionDeclared() => _expression.Declared(_message);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`39c9d934 488d5120 lea rdx,[rcx+20h]
00007ffd`39c9d938 488b02 mov rax,qword ptr [rdx]
00007ffd`39c9d93b 488b5108 mov rdx,qword ptr [rcx+8]
00007ffd`39c9d93f 488b4808 mov rcx,qword ptr [rax+8]
00007ffd`39c9d943 ff5018 call qword ptr [rax+18h]
00007ffd`39c9d946 90 nop
</code></pre></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr><th colspan="2">Delegates.ExpressionCompiled</th></tr>
<tr>
<th>.NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2633.0</th>
<th>.NET Core 2.0.7 (CoreCLR 4.6.26328.01, CoreFX 4.6.26403.03), 64bit RyuJIT</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align:top;"><pre><code>
00007ffd`4f8f0f70 StackOverflow.Performance.Delegates.Delegates.ExpressionCompiled()
public void ExpressionCompiled() => _expression.Compiled(_message);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`4f8f0f74 488d5120 lea rdx,[rcx+20h]
00007ffd`4f8f0f78 488b4208 mov rax,qword ptr [rdx+8]
00007ffd`4f8f0f7c 488b5108 mov rdx,qword ptr [rcx+8]
00007ffd`4f8f0f80 488b4808 mov rcx,qword ptr [rax+8]
00007ffd`4f8f0f84 ff5018 call qword ptr [rax+18h]
00007ffd`4f8f0f87 90 nop
</code></pre></td>
<td style="vertical-align:top;"><pre><code>
00007ffd`39c9d980 StackOverflow.Performance.Delegates.Delegates.ExpressionCompiled()
public void ExpressionCompiled() => _expression.Compiled(_message);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
00007ffd`39c9d984 488d5120 lea rdx,[rcx+20h]
00007ffd`39c9d988 488b4208 mov rax,qword ptr [rdx+8]
00007ffd`39c9d98c 488b5108 mov rdx,qword ptr [rcx+8]
00007ffd`39c9d990 488b4808 mov rcx,qword ptr [rax+8]
00007ffd`39c9d994 ff5018 call qword ptr [rax+18h]
00007ffd`39c9d997 90 nop
</code></pre></td>
</tr>
</tbody>
</table>
It would seem that the only difference between declared and compiled delegate disassembly is the rcx
for declared vs. the rcx+8
for compiled used within their respective first mov
operations. I am not yet that well-spoken in disassembly, so getting context around this would be greatly appreciated. At first glance, it would not seem that this would cause the difference/improvement, and if so, the native-declared delegate should feature it as well (so in other words, a bug).
With all of this stated, the obvious questions to me are:
- Is this a known issue and/or bug?
- Am I doing something entirely off-base here? (Guess this should be the first question. :))
- Is the guidance then to use compiled delegates always wherever possible? As I mentioned earlier, it would seem that the magic that happens in compiled delegates would already be baked into declared delegates, so this is a bit confusing.
For completeness, here is all of the code used in the sample here in its entirety:
sealed class Program
{
static void Main()
{
BenchmarkRunner.Run<Delegates>();
}
}
[CoreJob, ClrJob, DisassemblyDiagnoser(true, printSource: true)]
public class Delegates
{
readonly DelegatePair<string, string> _empty;
readonly DelegatePair<string, int> _expression;
readonly string _message;
public Delegates() : this(new DelegatePair<string, string>(_ => default, _ => default),
new DelegatePair<string, int>(x => x.Length, x => x.Length)) {}
public Delegates(DelegatePair<string, string> empty, DelegatePair<string, int> expression,
string message = "Hello World!")
{
_empty = empty;
_expression = expression;
_message = message;
EmptyDeclared();
EmptyCompiled();
ExpressionDeclared();
ExpressionCompiled();
}
[Benchmark]
public void EmptyDeclared() => _empty.Declared(default);
[Benchmark]
public void EmptyCompiled() => _empty.Compiled(default);
[Benchmark]
public void ExpressionDeclared() => _expression.Declared(_message);
[Benchmark]
public void ExpressionCompiled() => _expression.Compiled(_message);
}
public struct DelegatePair<TFrom, TTo>
{
DelegatePair(Func<TFrom, TTo> declared, Func<TFrom, TTo> compiled)
{
Declared = declared;
Compiled = compiled;
}
public DelegatePair(Func<TFrom, TTo> declared, Expression<Func<TFrom, TTo>> expression) :
this(declared, expression.Compile()) {}
public Func<TFrom, TTo> Declared { get; }
public Func<TFrom, TTo> Compiled { get; }
}
Thank you in advance for any assistance that you can provide!
expression.Compile()
returns a delegate that is allocated a more convenient location of memory than the one allocated fordeclared
so that it took lesser time to load that delegate into stack and invoke – GoetzLambdaExpression.Compile
method and the only thing I could find is that there is anextern
method call toDelegate.InternalAlloc
which returns aMulticastDelegate
. There's no way of knowing how that value is stored externally as it isextern
, so you might be onto something there. I have never heard of a preferred heap, however. Resources/links around this are welcomed. :) – Promenade