Performance of compiled-to-delegate Expression
Asked Answered
T

5

31

I'm generating an expression tree that maps properties from a source object to a destination object, that is then compiled to a Func<TSource, TDestination, TDestination> and executed.

This is the debug view of the resulting LambdaExpression:

.Lambda #Lambda1<System.Func`3[MemberMapper.Benchmarks.Program+ComplexSourceType,MemberMapper.Benchmarks.Program+ComplexDestinationType,MemberMapper.Benchmarks.Program+ComplexDestinationType]>(
    MemberMapper.Benchmarks.Program+ComplexSourceType $right,
    MemberMapper.Benchmarks.Program+ComplexDestinationType $left) {
    .Block(
        MemberMapper.Benchmarks.Program+NestedSourceType $Complex$955332131,
        MemberMapper.Benchmarks.Program+NestedDestinationType $Complex$2105709326) {
        $left.ID = $right.ID;
        $Complex$955332131 = $right.Complex;
        $Complex$2105709326 = .New MemberMapper.Benchmarks.Program+NestedDestinationType();
        $Complex$2105709326.ID = $Complex$955332131.ID;
        $Complex$2105709326.Name = $Complex$955332131.Name;
        $left.Complex = $Complex$2105709326;
        $left
    }
}

Cleaned up it would be:

(left, right) =>
{
    left.ID = right.ID;
    var complexSource = right.Complex;
    var complexDestination = new NestedDestinationType();
    complexDestination.ID = complexSource.ID;
    complexDestination.Name = complexSource.Name;
    left.Complex = complexDestination;
    return left;
}

That's the code that maps the properties on these types:

public class NestedSourceType
{
  public int ID { get; set; }
  public string Name { get; set; }
}

public class ComplexSourceType
{
  public int ID { get; set; }
  public NestedSourceType Complex { get; set; }
}

public class NestedDestinationType
{
  public int ID { get; set; }
  public string Name { get; set; }
}

public class ComplexDestinationType
{
  public int ID { get; set; }
  public NestedDestinationType Complex { get; set; }
}

The manual code to do this is:

var destination = new ComplexDestinationType
{
  ID = source.ID,
  Complex = new NestedDestinationType
  {
    ID = source.Complex.ID,
    Name = source.Complex.Name
  }
};

The problem is that when I compile the LambdaExpression and benchmark the resulting delegate it is about 10x slower than the manual version. I have no idea why that is. And the whole idea about this is maximum performance without the tedium of manual mapping.

When I take code by Bart de Smet from his blog post on this topic and benchmark the manual version of calculating prime numbers versus the compiled expression tree, they are completely identical in performance.

What can cause this huge difference when the debug view of the LambdaExpression looks like what you would expect?

EDIT

As requested I added the benchmark I used:

public static ComplexDestinationType Foo;

static void Benchmark()
{

  var mapper = new DefaultMemberMapper();

  var map = mapper.CreateMap(typeof(ComplexSourceType),
                             typeof(ComplexDestinationType)).FinalizeMap();

  var source = new ComplexSourceType
  {
    ID = 5,
    Complex = new NestedSourceType
    {
      ID = 10,
      Name = "test"
    }
  };

  var sw = Stopwatch.StartNew();

  for (int i = 0; i < 1000000; i++)
  {
    Foo = new ComplexDestinationType
    {
      ID = source.ID + i,
      Complex = new NestedDestinationType
      {
        ID = source.Complex.ID + i,
        Name = source.Complex.Name
      }
    };
  }

  sw.Stop();

  Console.WriteLine(sw.Elapsed);

  sw.Restart();

  for (int i = 0; i < 1000000; i++)
  {
    Foo = mapper.Map<ComplexSourceType, ComplexDestinationType>(source);
  }

  sw.Stop();

  Console.WriteLine(sw.Elapsed);

  var func = (Func<ComplexSourceType, ComplexDestinationType, ComplexDestinationType>)
             map.MappingFunction;

  var destination = new ComplexDestinationType();

  sw.Restart();

  for (int i = 0; i < 1000000; i++)
  {
    Foo = func(source, new ComplexDestinationType());
  }

  sw.Stop();

  Console.WriteLine(sw.Elapsed);
}

The second one is understandably slower than doing it manually as it involves a dictionary lookup and a few object instantiations, but the third one should be just as fast as it's the raw delegate there that's being invoked and the cast from Delegate to Func happens outside the loop.

I tried wrapping the manual code in a function as well, but I recall that it didn't make a noticeable difference. Either way, a function call shouldn't add an order of magnitude of overhead.

I also do the benchmark twice to make sure the JIT isn't interfering.

EDIT

You can get the code for this project here:

https://github.com/JulianR/MemberMapper/

I used the Sons-of-Strike debugger extension as described in that blog post by Bart de Smet to dump the generated IL of the dynamic method:

IL_0000: ldarg.2 
IL_0001: ldarg.1 
IL_0002: callvirt 6000003 ComplexSourceType.get_ID()
IL_0007: callvirt 6000004 ComplexDestinationType.set_ID(Int32)
IL_000c: ldarg.1 
IL_000d: callvirt 6000005 ComplexSourceType.get_Complex()
IL_0012: brfalse IL_0043
IL_0017: ldarg.1 
IL_0018: callvirt 6000006 ComplexSourceType.get_Complex()
IL_001d: stloc.0 
IL_001e: newobj 6000007 NestedDestinationType..ctor()
IL_0023: stloc.1 
IL_0024: ldloc.1 
IL_0025: ldloc.0 
IL_0026: callvirt 6000008 NestedSourceType.get_ID()
IL_002b: callvirt 6000009 NestedDestinationType.set_ID(Int32)
IL_0030: ldloc.1 
IL_0031: ldloc.0 
IL_0032: callvirt 600000a NestedSourceType.get_Name()
IL_0037: callvirt 600000b NestedDestinationType.set_Name(System.String)
IL_003c: ldarg.2 
IL_003d: ldloc.1 
IL_003e: callvirt 600000c ComplexDestinationType.set_Complex(NestedDestinationType)
IL_0043: ldarg.2 
IL_0044: ret 

I'm no expert at IL, but this seems pretty straightfoward and exactly what you would expect, no? Then why is it so slow? No weird boxing operations, no hidden instantiations, nothing. It's not exactly the same as expression tree above as there's also a null check on right.Complex now.

This is the code for the manual version (obtained through Reflector):

L_0000: ldarg.1 
L_0001: ldarg.0 
L_0002: callvirt instance int32 ComplexSourceType::get_ID()
L_0007: callvirt instance void ComplexDestinationType::set_ID(int32)
L_000c: ldarg.0 
L_000d: callvirt instance class NestedSourceType ComplexSourceType::get_Complex()
L_0012: brfalse.s L_0040
L_0014: ldarg.0 
L_0015: callvirt instance class NestedSourceType ComplexSourceType::get_Complex()
L_001a: stloc.0 
L_001b: newobj instance void NestedDestinationType::.ctor()
L_0020: stloc.1 
L_0021: ldloc.1 
L_0022: ldloc.0 
L_0023: callvirt instance int32 NestedSourceType::get_ID()
L_0028: callvirt instance void NestedDestinationType::set_ID(int32)
L_002d: ldloc.1 
L_002e: ldloc.0 
L_002f: callvirt instance string NestedSourceType::get_Name()
L_0034: callvirt instance void NestedDestinationType::set_Name(string)
L_0039: ldarg.1 
L_003a: ldloc.1 
L_003b: callvirt instance void ComplexDestinationType::set_Complex(class NestedDestinationType)
L_0040: ldarg.1 
L_0041: ret 

Looks identical to me..

EDIT

I followed the link in Michael B's answer about this topic. I tried implementing the trick in the accepted answer and it worked! If you want a summary of the trick: it creates a dynamic assembly and compiles the expression tree into a static method in that assembly and for some reason that's 10x faster. A downside to this is that my benchmark classes were internal (actually, public classes nested in an internal one) and it threw an exception when I tried to access them because they weren't accessible. There doesn't seem to be a workaround that, but I can simply detect if the types referenced are internal or not and decide which approach to compilation to use.

What still bugs me though is why that prime numbers method is identical in performance to the compiled expression tree.

And again, I welcome anyone to run the code at that GitHub repository to confirm my measurements and to make sure I'm not crazy :)

Tutor answered 19/2, 2011 at 19:30 Comment(11)
Which are the exact areas you are benchmarking, and how are you benchmarking them?Unlookedfor
would need to see the full usage, I think. For example, how are you invoking the delegate? (that matters lots)Bronwyn
Did you wrap the manual code in a delegate and call it the same way as your generated code?Extremity
Why does your code create a map variable but never use it? It only uses mapper.Map.Pipe
When I run it on my machine (.NET 4), then both are comparable. There must be something behind-the scene, that slows it down.Ywis
@Pipe - It does use it, but you have to scroll to the right to see it. It uses map.MappingFunction (which is of type Delegate) and casts it to a Func to test the raw generated delegate.Tutor
@ Euphoric - How did you run the code if I may ask, considering you don't have the source code (presumably, as it is on GitHub actually)? Which code did you run? I'm beginning to think it's some sort of (un)boxing operation that I'm missing, but I would need to see the IL of the delegate for that. Is that possible?Tutor
I've found how to dump the IL for this, see my edit. It seems fine though, which wasn't what I was hoping for..Tutor
Is there a performance difference btw running in release mode or debug mode (or with debugger connected vs not connected)?Kolinsky
@Marc Gravell, once I have an Action, what can I do besides Invoke() that is faster?Hiawatha
@Hiawatha when I made that comment, there was no code... I've seen people use DynamicInvoke before and expect it to be fast; this doesn't apply in your case.Bronwyn
H
20

This is pretty strange for such a huge overheard. There are a few things to take into account. First the VS compiled code has different properties applied to it that might influence the jitter to optimize differently.

Are you including the first execution for the compiled delegate in these results? You shouldn't, you should ignore the first execution of either code path. You should also turn the normal code into a delegate as delegate invocation is slightly slower than invoking an instance method, which is slower than invoking a static method.

As for other changes there is something to account for the fact that the compiled delegate has a closure object which isn't being used here but means that this is a targeted delegate which might perform a bit slower. You'll notice the compiled delegate has a target object and all the arguments are shifted down by one.

Also methods generated by lcg are considered static which tend to be slower when compiled to delegates than instance methods because of register switching business. (Duffy said that the "this" pointer has a reserved register in CLR and when you have a delegate for a static it has to be shifted to a different register invoking a slight overhead). Finally, code generated at runtime seems to run slightly slower than code generated by VS. Code generated at runtime seems to have extra sandboxing and is launched from a different assembly (try using something like ldftn opcode or calli opcode if you don't believe me, those reflection.emited delegates will compile but won't let you actually execute them) which invokes a minimal overhead.

Also you are running in release mode right? There was a similar topic where we looked over this problem here: Why is Func<> created from Expression<Func<>> slower than Func<> declared directly?

Edit: Also see my answer here: DynamicMethod is much slower than compiled IL function

The main takeaway is that you should add the following code to the assembly where you plan to create and invoke run-time generated code.

[assembly: AllowPartiallyTrustedCallers]
[assembly: SecurityTransparent]
[assembly: SecurityRules(SecurityRuleSet.Level2,SkipVerificationInFullTrust=true)]

And to always use a built-in delegate type or one from an assembly with those flags.

The reason being that anonymous dynamic code is hosted in an assembly that is always marked as partial trust. By allowing partially trusted callers you can skip part of the handshake. The transparency means that your code is not going to raise the security level (i.e. slow behavior), And finally the real trick is to invoke a delegate type hosted in an assembly that is marked as skip verification. Func<int,int>#Invoke is fully trusted, so no verification is needed. This will give you performance of code generated from the VS compiler. By not using these attributes you are looking at an overhead in .NET 4. You might think that SecurityRuleSet.Level1 would be a good way to avoid this overhead, but switching security models is also expensive.

In short, add those attributes, and then your micro-loop performance test, will run about the same.

Hotheaded answered 1/3, 2011 at 21:23 Comment(1)
Thanks for your answer. I run the benchmark twice to rule out the JIT overhead. The thing that weirds me out the most is that the pretty complex prime numbers expression-tree from that blog post is identical in performance to the hand-written one when compiled. I followed the link in your answer, it was very helpful, see the edit of my question :)Tutor
P
3

It sounds like you're running into invocation overhead. Regardless of the source, though, if your method runs faster when loaded from a compiled assembly, simply compile it into an assembly and load it! See my answer at Why is Func<> created from Expression<Func<>> slower than Func<> declared directly? for more details on how.

Pipe answered 4/3, 2011 at 13:54 Comment(3)
Yes, that's the compromise I've settled on. It doesn't work on non-public types or generic types though (a generic type under the hood uses System.__Canon which is internal), which is a downside, but I simply detect those types and use the slower version of compilation. And I could accept that I'm running into some sort of overhead with simply calling Compile on the expression, if it weren't for that prime number function that's equally fast. And sorry but I'm gonna award the bounty to Michael B because I found the answer a little sooner through him, but thanks :)Tutor
@JulianR: You're not calling Compile each time you run the expression, are you?Pipe
No :) It would be much slower than it is.Tutor
O
3

You are may compile Expression Tree manually via Reflection.Emit. It will generally provide faster compilation time (in my case below ~30 times faster), and will allow you to tune emitted result performance. And it not so hard to do, especially if your Expressions are limited known subset.

The idea is to use ExpressionVisitor to traverse the expression and emit the IL for corresponding expression type. It's also "quite" simple to write your own Visitor to handle the known subset of expressions, and fallback to normal Expression.Compile for not yet supported expression types.

In my case I am generating the delegate:

Func<object[], object> createA = state =>
    new A(
        new B(), 
        (string)state[11], 
        new ID[2] { new D1(), new D2() }) { 
        Prop = new P(new B()), Bop = new B() 
    };

The test creates the corresponding expression tree and compares its Expression.Compile vs visiting and emitting the IL and then creating delegate from DynamicMethod.

The results:

Compile Expression 3000 times: 814
Invoke Compiled Expression 5000000 times: 724
Emit from Expression 3000 times: 36
Run Emitted Expression 5000000 times: 722

36 vs 814 when compiling manually.

Here the full code.

Obligor answered 24/11, 2015 at 11:41 Comment(0)
M
2

Check these links to see what happens when you compile your LambdaExpression (and yes, it is done using Reflection)

  1. http://msdn.microsoft.com/en-us/magazine/cc163759.aspx#S3
  2. http://blogs.msdn.com/b/ericgu/archive/2004/03/19/92911.aspx
Mentality answered 4/3, 2011 at 12:43 Comment(1)
Interesting reads, thanks. But I'm not sure what you mean by "and yes, it is done using reflection". I know the compilation process uses type metadata somehow, but I'm not measuring the overhead of that, I'm measuring the result, which is just plain IL as you can see in my question.Tutor
S
1

I think that's the impact of having Reflection at this point. The second method is using reflection to get and set the values. As far as I can see that at this point, it's not the delegate, but the reflection that costs its time.

About the third solution: Also Lambda Expressions need to be evaluated at runtime, which also costs its time. And that's not few...

So you'll never get the second and third solution as fast as the manual copying.

Have a look at my code samples here. Think that is propably the fasted solution you can take, if you don't want manual coding: http://jachman.wordpress.com/2006/08/22/2000-faster-using-dynamic-method-calls/

Scamander answered 19/2, 2011 at 21:32 Comment(3)
But I'm not using reflection when invoking the delegate. The expression tree is built using reflection, but it's compiled to a delegate which should be JIT compiled to produce fast code. I could've accepted that the JIT compiler doesn't spend much time on optimizing it or something, but that much more complex prime numbers code using expressions from Bart de Smet is just as fast as the normal version. So it can be just as fast, but why isn't mine?Tutor
Sure, not at the third solution. But at this time, the JIT Compiler has to evaluate the Lambda Expression. As you already pointed out, that is the overhead for evaluating the Expression Tree. It's really that much. I already implemented the IQueryable Interface for some other object mapping issues and there are really beyond belief amount of calls that you don't see when calling the lambda from your code.Scamander
No, the third benchmark uses the compiled delegate. Besides, the overhead is 'only' 10x, which would be much, much more if it was pure reflection. For example, the AutoMapper library which does use reflection for its mapping I believe, is 400x slower than manual mapping from my tests.Tutor

© 2022 - 2024 — McMap. All rights reserved.