Curiosity: Why does Expression<...> when compiled run faster than a minimal DynamicMethod?

Asked 18/8, 2009 at 21:39 Answered 18/8, 2009 at 23:17

Solved c#profiling reflection.emit expression dynamicmethod

I'm currently doing some last-measure optimizations, mostly for fun and learning, and discovered something that left me with a couple of questions.

First, the questions:

When I construct a method in-memory through the use of DynamicMethod, and use the debugger, is there any way for me to step into the generated assembly code, when vieweing the code in the disassembler view? The debugger seems to just step over the whole method for me
Or, if that's not possible, is it possible for me to somehow save the generated IL code to disk as an assembly, so that I can inspect it with Reflector?
Why does the Expression<...> version of my simple addition method (Int32+Int32 => Int32) run faster than a minimal DynamicMethod version?

Here's a short and complete program that demonstrates. On my system, the output is:

DynamicMethod: 887 ms
Lambda: 1878 ms
Method: 1969 ms
Expression: 681 ms

I expected the lambda and method calls to have higher values, but the DynamicMethod version is consistently about 30-50% slower (variations probably due to Windows and other programs). Anyone know the reason?

Here's the program:

using System;
using System.Linq.Expressions;
using System.Reflection.Emit;
using System.Diagnostics;

namespace Sandbox
{
    public class Program
    {
        public static void Main(String[] args)
        {
            DynamicMethod method = new DynamicMethod("TestMethod",
                typeof(Int32), new Type[] { typeof(Int32), typeof(Int32) });
            var il = method.GetILGenerator();

            il.Emit(OpCodes.Ldarg_0);
            il.Emit(OpCodes.Ldarg_1);
            il.Emit(OpCodes.Add);
            il.Emit(OpCodes.Ret);

            Func<Int32, Int32, Int32> f1 =
                (Func<Int32, Int32, Int32>)method.CreateDelegate(
                    typeof(Func<Int32, Int32, Int32>));
            Func<Int32, Int32, Int32> f2 = (Int32 a, Int32 b) => a + b;
            Func<Int32, Int32, Int32> f3 = Sum;
            Expression<Func<Int32, Int32, Int32>> f4x = (a, b) => a + b;
            Func<Int32, Int32, Int32> f4 = f4x.Compile();
            for (Int32 pass = 1; pass <= 2; pass++)
            {
                // Pass 1 just runs all the code without writing out anything
                // to avoid JIT overhead influencing the results
                Time(f1, "DynamicMethod", pass);
                Time(f2, "Lambda", pass);
                Time(f3, "Method", pass);
                Time(f4, "Expression", pass);
            }
        }

        private static void Time(Func<Int32, Int32, Int32> fn,
            String name, Int32 pass)
        {
            Stopwatch sw = new Stopwatch();
            sw.Start();
            for (Int32 index = 0; index <= 100000000; index++)
            {
                Int32 result = fn(index, 1);
            }
            sw.Stop();
            if (pass == 2)
                Debug.WriteLine(name + ": " + sw.ElapsedMilliseconds + " ms");
        }

        private static Int32 Sum(Int32 a, Int32 b)
        {
            return a + b;
        }
    }
}

Caespitose answered 18/8, 2009 at 21:39 Comment(6)

Really good question. First, for this type of profiling, I'd use a release/Console - so the Debug.WriteLine looks out of place; but even with Console.WriteLine my stats are similar: DynamicMethod: 630 ms Lambda: 561 ms Method: 553 ms Expression: 360 ms I'm still looking... – Northampton 18/8, 2009 at 21:49

Interesting question. These kinds of things can be solved using WinDebug and SOS. I posted a step by step of a similar analysis I did many moons ago in my blog, blog.barrkel.com/2006/05/clr-tailcall-optimization-or-lack.html – Constitute 18/8, 2009 at 23:25

I thought I should ping you - I found out how to force JIT without having to call the method once. Use the restrictedSkipVisibility DynamicMethod constructor argument. Depending on context (code security), it might not be available though. – Constitute 19/8, 2009 at 13:43

Good questions! Regarding #1, I don't think that DynamicMethods are debuggable (in opposite to emitting an assembly). However you can dumb and analyze the DynamicMethod's body. I use ILVisualizer - fancy and convenient enough. Regards, Vadim – Tinfoil 16/8, 2011 at 12:56

#11024493 Tony THONG – Cathedral 7/7, 2014 at 10:9

Also wouldn't not making use of result in Int32 result = fn(index, 1); affects the result? Dead code optimization should kick in right? I tried printing result as well, and I get DynamicMethod: ~1000 ms Lambda: ~400 ms Method: ~1000 ms Expression: ~400 ms. For me, results of (DynamicMethod and Method) and (Lambda and Expression) always aligned, unlike results here which align (DynamicMethod and Expression) and (Lambda and Method). Strange.. using C# 6.0, .NET 4.6, Any CPU on a 64 bit machine. – Grendel 1/7, 2016 at 9:25

The method created via DynamicMethod goes through two thunks, while the method created via Expression<> doesn't go through any.

Here's how it works. Here's the calling sequence for invoking fn(0, 1) in the Time method (I hard-coded the arguments to 0 and 1 for ease of debugging):

00cc032c 6a01            push    1           // 1 argument
00cc032e 8bcf            mov     ecx,edi
00cc0330 33d2            xor     edx,edx     // 0 argument
00cc0332 8b410c          mov     eax,dword ptr [ecx+0Ch]
00cc0335 8b4904          mov     ecx,dword ptr [ecx+4]
00cc0338 ffd0            call    eax // 1 arg on stack, two in edx, ecx

For the first invocation I investigated, DynamicMethod, the call eax line comes up like so:

00cc0338 ffd0            call    eax {003c2084}
0:000> !u 003c2084
Unmanaged code
003c2084 51              push    ecx
003c2085 8bca            mov     ecx,edx
003c2087 8b542408        mov     edx,dword ptr [esp+8]
003c208b 8b442404        mov     eax,dword ptr [esp+4]
003c208f 89442408        mov     dword ptr [esp+8],eax
003c2093 58              pop     eax
003c2094 83c404          add     esp,4
003c2097 83c010          add     eax,10h
003c209a ff20            jmp     dword ptr [eax]

This appears to be doing some stack swizzling to rearrange arguments. I speculate that it's owing to the difference between delegates that use the implicit 'this' argument and those that don't.

That jump at the end resolves like so:

003c209a ff20            jmp     dword ptr [eax]      ds:0023:012f7edc=0098c098
0098c098 e963403500      jmp     00ce0100

The remainder of the code at 0098c098 looks like a JIT thunk, whose start got rewritten with a jmp after the JIT. It's only after this jump that we get to real code:

0:000> !u eip
Normal JIT generated code
DynamicClass.TestMethod(Int32, Int32)
Begin 00ce0100, size 5
>>> 00ce0100 03ca            add     ecx,edx
00ce0102 8bc1            mov     eax,ecx
00ce0104 c3              ret

The invocation sequence for the method created via Expression<> is different - it's missing the stack swizzling code. Here it is, from the first jump via eax:

00cc0338 ffd0            call    eax {00ce00a8}

0:000> !u eip
Normal JIT generated code
DynamicClass.lambda_method(System.Runtime.CompilerServices.ExecutionScope, Int32, Int32)
Begin 00ce00a8, size b
>>> 00ce00a8 8b442404        mov     eax,dword ptr [esp+4]
00ce00ac 03d0            add     edx,eax
00ce00ae 8bc2            mov     eax,edx
00ce00b0 c20400          ret     4

Now, how did things get like this?

Stack swizzling wasn't necessary (the implicit first argument from the delegate is actually used, i.e. not like a delegate bound to a static method)
The JIT must have been forced by LINQ compilation logic so that the delegate held the real destination address rather than a fake one.

I don't know how the LINQ forced the JIT, but I do know how to force a JIT myself - by calling the function at least once. UPDATE: I found another way to force a JIT: use the restrictedSkipVisibility argumetn to the constructor and pass true. So, here's modified code that eliminates stack swizzling by using the implicit 'this' parameter, and uses the alternate constructor to pre-compile so that the bound address is the real address, rather than the thunk:

using System;
using System.Linq.Expressions;
using System.Reflection.Emit;
using System.Diagnostics;

namespace Sandbox
{
    public class Program
    {
        public static void Main(String[] args)
        {
            DynamicMethod method = new DynamicMethod("TestMethod",
                typeof(Int32), new Type[] { typeof(object), typeof(Int32),
                typeof(Int32) }, true);
            var il = method.GetILGenerator();

            il.Emit(OpCodes.Ldarg_1);
            il.Emit(OpCodes.Ldarg_2);
            il.Emit(OpCodes.Add);
            il.Emit(OpCodes.Ret);

            Func<Int32, Int32, Int32> f1 =
                (Func<Int32, Int32, Int32>)method.CreateDelegate(
                    typeof(Func<Int32, Int32, Int32>), null);
            Func<Int32, Int32, Int32> f2 = (Int32 a, Int32 b) => a + b;
            Func<Int32, Int32, Int32> f3 = Sum;
            Expression<Func<Int32, Int32, Int32>> f4x = (a, b) => a + b;
            Func<Int32, Int32, Int32> f4 = f4x.Compile();
            for (Int32 pass = 1; pass <= 2; pass++)
            {
                // Pass 1 just runs all the code without writing out anything
                // to avoid JIT overhead influencing the results
                Time(f1, "DynamicMethod", pass);
                Time(f2, "Lambda", pass);
                Time(f3, "Method", pass);
                Time(f4, "Expression", pass);
            }
        }

        private static void Time(Func<Int32, Int32, Int32> fn,
            String name, Int32 pass)
        {
            Stopwatch sw = new Stopwatch();
            sw.Start();
            for (Int32 index = 0; index <= 100000000; index++)
            {
                Int32 result = fn(index, 1);
            }
            sw.Stop();
            if (pass == 2)
                Console.WriteLine(name + ": " + sw.ElapsedMilliseconds + " ms");
        }

        private static Int32 Sum(Int32 a, Int32 b)
        {
            return a + b;
        }
    }
}

Here's the runtimes on my system:

DynamicMethod: 312 ms
Lambda: 417 ms
Method: 417 ms
Expression: 312 ms

UPDATED TO ADD:

I tried running this code on my new system, which is an Core i7 920 running Windows 7 x64 with .NET 4 beta 2 installed (mscoree.dll ver. 4.0.30902), and the results are, well, variable.

csc 3.5, /platform:x86, runtime v2.0.50727 (via .config)

Run #1
DynamicMethod: 214 ms
Lambda: 571 ms
Method: 570 ms
Expression: 249 ms

Run #2
DynamicMethod: 463 ms
Lambda: 392 ms
Method: 392 ms
Expression: 463 ms

Run #3
DynamicMethod: 463 ms
Lambda: 570 ms
Method: 570 ms
Expression: 463 ms

Perhaps this is Intel SpeedStep affecting results, or possibly Turbo Boost. In any case, it's very annoying.

csc 3.5, /platform:x64, runtime v2.0.50727 (via .config)
DynamicMethod: 428 ms
Lambda: 392 ms
Method: 392 ms
Expression: 428 ms

csc 3.5, /platform:x64, runtime v4
DynamicMethod: 428 ms
Lambda: 356 ms
Method: 356 ms
Expression: 428 ms

csc 4, /platform:x64, runtime v4
DynamicMethod: 428 ms
Lambda: 356 ms
Method: 356 ms
Expression: 428 ms

csc 4, /platform:x86, runtime v4
DynamicMethod: 463 ms
Lambda: 570 ms
Method: 570 ms
Expression: 463 ms

csc 3.5, /platform:x86, runtime v4
DynamicMethod: 214 ms
Lambda: 570 ms
Method: 571 ms
Expression: 249 ms

Many of these results will be accidents of timing, whatever it is that is causing the random speedups in the C# 3.5 / runtime v2.0 scenario. I'll have to reboot to see if SpeedStep or Turbo Boost is responsible for these effects.

Constitute answered 18/8, 2009 at 23:17 Comment(10)

So that means I need to add a way to invoke my method safely, just to get that performance boost? I can certainly do that. – Caespitose 19/8, 2009 at 8:24

What I mean is... the methods I'm creating won't actually be summing two numbers, but be responsible for constructing and resolving services in an IoC implementation. In this case, I don't really want the full method to execute and construct a service, just to get that minor performance boost. Seeing as some services will be used a lot, and the actual service is miniscule and lightweight, I am putting some effort into the actual resolution code as well. Besides, it's a fun learning project for reflection.emit. Really appreciate the work you put into your answer! – Caespitose 19/8, 2009 at 8:27

Nice article! It is a "known fact" that using a DynamicMethod to be invoked "statically" generates this additional thunking. – Allgood 2/6, 2010 at 14:5

Btw I have thought about it and I believe the extra thunk for the DynamicMethod is this: The thunk's result is to decrease the size of the activation record above the dynamic call by 1 word. The thunk also results in a removal of the implicit this pointer in the function call. You'll notice that after the thunk, there are no calls, just jmp instructions. So any stack walks that occur in the dynamic method will see the current activation record as if it were a static method, even though it had been called with a this pointer. – Nagari 11/7, 2012 at 9:20

Why does the CLR always treat dynamic methods as nonstatic at JIT-time, but then insert a thunk depending on the type of delegate created from the CreateDelegate call? I am pretty certain that this is for memory management. The memory which stores the delegate object itself, the binding between the dynamic method and an object (and its delegate type), and also the emitted IL (I think) are all stored on the garbage collected heap. This is managed memory. The JITed code is stored in the processes working set, but is not managed by the garbage collector. – Nagari 11/7, 2012 at 9:33

So rather than re JIT a method every time CreateDelegate is called, the CLR just JITs a DynamicMethod once. Calls to CreateDelegate sometimes create a static delegate, and sometimes create an instance delegate. However, the JIT cannot possibly know which one will be used at the time it compiles the DynamicMethod, unless it compiled it once for every call to CreateDelegate!! So there are conflicting goals: 1. JIT as little code as possible, and 2. Allow a single DynamicMethod to be both static and instance. – Nagari 11/7, 2012 at 9:39

The only way I can imagine to accomplish both of these two goals is to insert the thunk that the CLR inserts. I tried to think of creative ways to lay out the DynamicMethod... Maybe have two entrypoints into the JITed code, and just pretend the thunk to the beginning of the DynamicMethod. That way a call to CreateDelegate on a given DynamicMethod would always use the same JITed function, but when static would add an offset equal to the size of the Thunk to the function pointer in the delegate created. But then the thunks might be repeated in a lot of places (waste of memory...) – Nagari 11/7, 2012 at 9:46

Alternatively the JIT could insert the thunks as they are now, but only do so for instance type delegates. Apparently the CLR can't do that though. My guess is that there is special code for the implementation of DynamicMethod that created native code in reclaimable address space, and for whatever reason it is either not possible for a delegate to carry a function pointer to code in that address space (CLR won't allow for security), or it is not possible for the JIT to emit a call to the address space used by DynamicMethod (maybe the doesn't know or the special code is a function). – Nagari 11/7, 2012 at 9:55

Hell, maybe the native code from DynamicMethod is stored in garbage collected heap space after all, somehow. – Nagari 11/7, 2012 at 9:55

Dead end in SSCLI. The internal call used by CreateDelegate does not exist in Rotor. – Nagari 11/7, 2012 at 10:17

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags