The 32 bytes limit for inline function ... not too small?

Asked 11/12, 2016 at 22:37 Answered 26/9, 2019 at 3:28

I have a very small c# code marked as inline, but dont work. I have seen that the longest function generates more than 32 bytes of IL code. Does the limit of 32 bytes too short ?

// inlined
[MethodImpl(MethodImplOptions.AggressiveInlining)] 
static public bool INL_IsInRange (this byte pValue, byte pMin) {
  return(pValue>=pMin);
}

// NOT inlined
[MethodImpl(MethodImplOptions.AggressiveInlining)] 
static public bool INL_IsInRange (this byte pValue, byte pMin, byte pMax) {
  return(pValue>=pMin&&pValue<=pMax);
}

Is it possible to change that limit?

Marentic answered 11/12, 2016 at 22:37 Comment(7)

"Dont work" is not very helpful. Please provide details about what happens. If there is an exception, please provide the exception details. – Stilt 11/12, 2016 at 23:27

@DeanOC: Based on the title and content of the question, I assume the problem is that the function isn't inlined. – Stirling 11/12, 2016 at 23:31

I mean it does not run as inline function, the first function does run with inline code, but the second one runs with a call. I assumed that it is a question about the limit of 32 bytes that the JIT has as a restriction for the heuristic function that determines whether to place it as inline or not. I'm correct ? this is the function problem ? – Marentic 11/12, 2016 at 23:35

@MichaelLiu I gave up assuming what posters mean a long time ago. I go down fewer rabbit holes that way. ;) – Stilt 12/12, 2016 at 0:4

On my machine, neither method generates more than 13 bytes of IL code, so that's not the problem. As I understand it AggressiveInlining does nothing here since it only bypasses the size limit anyway, not any other criteria for not inlining a function. It seems more likely that the fact that the second function includes a branch in the IL plays a role. But this is pure speculation; you'd have to look at the code of the jitter to tell for sure. Note that there are big differences between the 32-bit and the 64-bit jitter, in all versions. – Royal 12/12, 2016 at 12:18

Thank you all. True, the ILCode is less than 32 in both (maybe because of tiredness, I counted the bytes assembly ... sorry ... :-). I have not found the answer definitively, nor will I look for it for now ... but I have found that the JIT places the function "INLINE" or not analyzing other factors that have to do with the parent function that is calling ... apparently in Long functions (like functions of "TestOfClassX") these methods are not placed INLINE, but when used from shorter functions yes! ... If anyone knows about these kinds of limitations, I would be grateful if they would tell me. – Marentic 15/12, 2016 at 0:58

Gracias a todos. Es cierto, el ILCode es menor a 32 en ambas (tal vez por cansancio, conté los bytes assembly ... sorry :-). No he encontrado la respuesta definitivamente, ni la buscaré por ahora... pero he encontrado que el JIT coloca la funcion "INLINE" o no analizando otros factores que tienen que ver con la funcion padre que la esta llamando ... al parecer en funciones largas (como funciones de "TestOfClassX") estos metodos no se colocan INLINE, pero cuando se usa desde funciones mas cortas sí ! ... Si alguien sabe acerca de este tipo de limitaciones, les agradeceria que me lo dijieran. – Marentic 15/12, 2016 at 0:58

I am looking for inline function criteria also. In your case, I believe that JIT optimization timed out before it could reach the decision to inline your second function. For JIT, it's not a priority to inline a function, so it was busy analyzing your long code. However, if you place your calls inside tight loops, JIT will probably inline them, as inner calls gain priority to inline. If you really care about this type of micro-optimization, it's time to switch to C++. It's a whole new brave world out there for you to explore and exploit!

I noticed that the question had been edited right after this answer had been posted, meaning a high level of interactivity. Well, I don't know why there is a limit of 32 bytes, but that seems to be exactly the size of a CPU cache block, conservatively speaking. What a coincidence! In any case, code optimization must be done with a particular hardware configuration, better saved in an extra file side by side with its assembly. The timeout policy is stupid, because optimization is not supposed to be done at run-time, competing against the precious code execution time. Optimization is supposed to be done at application load-time, only the first time it's run on the machine, once for all. It can be triggered again when hardware configuration change is detected. Again, if you really need performance, just go with C/C++. C# is not designed for performance and will never make performance its top priority. Like Java, C# is designed for safety, with a much stronger caution against possible negative performance impacts.

Burgle answered 27/2, 2018 at 23:15 Comment(0)

Up to the "32-bytes of IL" limit, there are a number of other factors which affect whether a method would be inlined or not. There are at least a couple of articles that describe these factors.

One article explains that a scoring heuristic is used to adjust an initial guess about the relative size of the code when inlined vs not (i.e. whether the call site is larger or smaller than the inlined code itself):

If inlining makes code smaller then the call it replaces, it is ALWAYS good. Note that we are talking about the NATIVE code size, not the IL code size (which can be quite different).

The more a particular call site is executed, the more it will benefit from inlning. Thus code in loops deserves to be inlined more than code that is not in loops.

If inlining exposes important optimizations, then inlining is more desirable. In particular methods with value types arguments benefit more than normal because of optimizations like this and thus having a bias to inline these methods is good.

Thus the heuristic the X86 JIT compiler uses is, given an inline candidate.

Estimate the size of the call site if the method were not inlined.

Estimate the size of the call site if it were inlined (this is an estimate based on the IL, we employ a simple state machine (Markov Model), created using lots of real data to form this estimator logic)

Compute a multiplier. By default it is 1

Increase the multiplier if the code is in a loop (the current heuristic bumps it to 5 in a loop)

Increase the multiplier if it looks like struct optimizations will kick in.

If InlineSize <= NonInlineSize * Multiplier do the inlining.

Another article explains several conditions that will prevent a method from being inlined based on their mere existence (including the "32-bytes of IL" limit):

These are some of the reasons for which we won't inline a method:

Method is marked as not inline with the CompilerServices.MethodImpl attribute.

Size of inlinee is limited to 32 bytes of IL: This is a heuristic, the rationale behind it is that usually, when you have methods bigger than that, the overhead of the call will not be as significative compared to the work the method does. Of course, as a heuristic, it fails in some situations. There have been suggestions for us adding an attribute to control these threshold. For Whidbey, that attribute has not been added (it has some very bad properties: it's x86 JIT specific and it's longterm value, as compilers get smarter, is dubious).

Virtual calls: We don't inline across virtual calls. The reason for not doing this is that we don't know the final target of the call. We could potentially do better here (for example, if 99% of calls end up in the same target, you can generate code that does a check on the method table of the object the virtual call is going to execute on, if it's not the 99% case, you do a call, else you just execute the inlined code), but unlike the J language, most of the calls in the primary languages we support, are not virtual, so we're not forced to be so aggressive about optimizing this case.

Valuetypes: We have several limitations regarding value types an inlining. We take the blame here, this is a limitation of our JIT, we could do better and we know it. Unfortunately, when stack ranked against other features of Whidbey, getting some statistics on how frequently methods cannot be inlined due to this reason and considering the cost of making this area of the JIT significantly better, we decided that it made more sense for our customers to spend our time working in other optimizations or CLR features. Whidbey is better than previous versions in one case: value types that only have a pointer size int as a member, this was (relatively) not expensive to make better, and helped a lot in common value types such as pointer wrappers (IntPtr, etc).

MarshalByRef: Call targets that are in MarshalByRef classes won't be inlined (call has to be intercepted and dispatched). We've got better in Whidbey for this scenario

VM restrictions: These are mostly security, the JIT must ask the VM for permission to inline a method (see CEEInfo::canInline in Rotor source to get an idea of what kind of things the VM checks for).

Complicated flowgraph: We don't inline loops, methods with exception handling regions, etc...

If basic block that has the call is deemed as it won't execute frequently (for example, a basic block that has a throw, or a static class constructor), inlining is much less aggressive (as the only real win we can make is code size)

Other: Exotic IL instructions, security checks that need a method frame, etc...

Pitt answered 26/9, 2019 at 3:28 Comment(0)

Recommended topics

Hot tags