LuaJIT FFI callback performance
Asked Answered
A

4

7

The LuaJIT FFI docs mention that calling from C back into Lua code is relatively slow and recommend avoiding it where possible:

Do not use callbacks for performance-sensitive work: e.g. consider a numerical integration routine which takes a user-defined function to integrate over. It's a bad idea to call a user-defined Lua function from C code millions of times. The callback overhead will be absolutely detrimental for performance.

For new designs avoid push-style APIs (C function repeatedly calling a callback for each result). Instead use pull-style APIs (call a C function repeatedly to get a new result). Calls from Lua to C via the FFI are much faster than the other way round. Most well-designed libraries already use pull-style APIs (read/write, get/put).

However, they don't give any sense of how much slower callbacks from C are. If I have some code that I want to speed up that uses callbacks, roughly how much of a speedup could I expect if I rewrote it to use a pull-style API? Does anyone have any benchmarks comparing implementations of equivalent functionality using each style of API?

Ascent answered 8/9, 2012 at 8:9 Comment(0)
A
11

On my computer, a function call from LuaJIT into C has an overhead of 5 clock cycles (notably, just as fast as calling a function via a function pointer in plain C), whereas calling from C back into Lua has a 135 cycle overhead, 27x slower. That being said, program that required a million calls from C into Lua would only add ~100ms overhead to the program's runtime; while it might be worth it to avoid FFI callbacks in a tight loop that operates on mostly in-cache data, the overhead of callbacks if they're invoked, say, once per I/O operation is probably not going to be noticeable compared to the overhead of the I/O itself.

$ luajit-2.0.0-beta10 callback-bench.lua   
C into C          3.344 nsec/call
Lua into C        3.345 nsec/call
C into Lua       75.386 nsec/call
Lua into Lua      0.557 nsec/call
C empty loop      0.557 nsec/call
Lua empty loop    0.557 nsec/call

$ sysctl -n machdep.cpu.brand_string         
Intel(R) Core(TM) i5-3427U CPU @ 1.80GHz

Benchmark code: https://gist.github.com/3726661

Ascent answered 15/9, 2012 at 6:59 Comment(1)
Awesome gist, though I always appreciate it when there are some relatively standard compiler instructions in there too (e.g.: gcc -shared...)Lucia
P
8

Since this issue (and LJ in general) has been the source of great pain for me, I'd like to toss some extra information into the ring, in hopes that it may assist someone out there in the future.

'Callbacks' are Not Always Slow

The LuaJIT FFI documentation, when it says 'callbacks are slow,' is referring very specifically to the case of a callback created by LuaJIT and passed through FFI to a C function that expects a function pointer. This is completely different from other callback mechanisms, in particular, it has entirely different performance characteristics compared to calling a standard lua_CFunction that uses the API to invoke a callback.

With that said, the real question is then: when do we use the Lua C API to implement logic that involves pcall et al, vs. keeping everything in Lua? As always with performance, but especially in the case of a tracing JIT, one must profile (-jp) to know the answer. Period.

I have seen situations that looked similar yet fell on opposite ends of the performance spectrum; that is, I have encountered code (not toy code, but rather production code in the context of writing a high-perf game engine) that performs better when structured as Lua-only, as well as code (that seems structurally-similar) that performs better upon introducing a language boundary via calling a lua_CFunction that uses luaL_ref to maintain handles to callbacks and callback arguments.

Optimizing for LuaJIT without Measurement is a Fool's Errand

Tracing JITs are already hard to reason about, even if you're an expert in static language perf analysis. They take everything you thought you knew about performance and shatter it to pieces. If the concept of compiling recorded IR rather than compiling functions doesn't already annihilate one's ability to reason about LuaJIT performance, then the fact that calling into C via the FFI is more-or-less free when successfully JITed, yet potentially an order-of-magnitude more expensive than an equivalent lua_CFunction call when interpreted...well, this for sure pushes the situation over the edge.

Concretely, a system that you wrote last week that vastly out-performed a C equivalent may tank this week because you introduced an NYI in trace-proximity to said system, which may well have come from a seemingly-orthogonal region of code, and now your system is falling back and obliterating performance. Even worse, perhaps you're well-aware of what is and isn't an NYI, but you added just enough code to the trace proximity that it exceeded the JIT's max recorded IR instructions, max virtual registers, call depth, unroll factor, side trace limit...etc.

Also, note that, while 'empty' benchmarks can sometimes give a very general insight, it is even more important with LJ (for the aforementioned reasons) that code be profiled in context. It is very, very difficult to write representative performance benchmarks for LuaJIT, since traces are, by their nature, non-local. When using LJ in a large application, these non-local interactions become tremendously impactful.

TL;DR

There is exactly one person on this planet who really and truly understands the behavior of LuaJIT. His name is Mike Pall.

If you are not Mike Pall, do not assume anything about LJ behavior and performance. Use -jv (verbose; watch for NYIs and fallbacks), -jp (profiler! Combine with jit.zone for custom annotations; use -jp=vf to see what % of your time is being spent due in the interpreter due to fallbacks), and, when you really need to know what's going on, -jdump (trace IR & ASM). Measure, measure, measure. Take generalizations about LJ performance characteristics with a grain of salt unless they come from the man himself or you've measured them in your specific usage case (in which case, after all, it's not a generalization). And remember, the right solution might be all in Lua, it might be all in C, it might be Lua -> C through FFI, it might be Lua -> lua_CFunction -> Lua, ...you get the idea.

Coming from someone who has been fooled time-and-time-again into thinking that he has understood LuaJIT, only to be proven wrong the following week, I sincerely hope this information helps someone out there :) Personally, I simply no longer make 'educated guess' about LuaJIT. My engine outputs jv and jp logs for every run, and they are the 'word of God' for me with respect to optimization.

Precognition answered 29/11, 2017 at 19:52 Comment(0)
C
7

Two years later, I redid the benchmarks from Miles' answer, for the following reasons:

  1. See if they improved with the new advancements (in CPU and LuaJIT)
  2. To add tests for functions with parameters and returns. The callback documentation mentiones that apart from the call overhead, parameter marshalling also matters:

    [...] the C to Lua transition itself has an unavoidable cost, similar to a lua_call() or lua_pcall(). Argument and result marshalling add to that cost [...]

  3. Check the difference between PUSH style and PULL style.

My results, on Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz:

operation                  reps     time(s) nsec/call
C into Lua set_v          10000000  0.498    49.817
C into Lua set_i          10000000  0.662    66.249
C into Lua set_d          10000000  0.681    68.143
C into Lua get_i          10000000  0.633    63.272
C into Lua get_d          10000000  0.650    64.990
Lua into C call(void)    100000000  0.381     3.807
Lua into C call(int)     100000000  0.381     3.815
Lua into C call(double)  100000000  0.415     4.154
Lua into Lua             100000000  0.104     1.039
C empty loop            1000000000  0.695     0.695
Lua empty loop          1000000000  0.693     0.693

PUSH style               1000000    0.158   158.256
PULL style               1000000    0.207   207.297

The code for this results is here.

Conclusion: C callbacks into Lua have a really big overhead when used with parameters (which is what you almost always do), so they really shouldn't be used in critical points. You can use them for IO or user input though.

I am a bit surprised there is so little difference between PUSH/PULL styles, but maybe my implementation is not among the best.

Cab answered 27/11, 2014 at 18:8 Comment(0)
M
5

There is a significant performance difference, as shown by these results:

LuaJIT 2.0.0-beta10 (Windows x64)
JIT: ON CMOV SSE2 SSE3 SSE4.1 fold cse dce fwd dse narrow loop abc sink fuse
n          Push Time        Pull Time        Push Mem         Pull Mem
256        0.000333         0                68               64
4096       0.002999         0.001333         188              124
65536      0.037999         0.017333         2108             1084
1048576    0.588333         0.255            32828            16444
16777216   9.535666         4.282999         524348           262204

The code for this benchmark can be found here.

Mcgann answered 8/9, 2012 at 11:5 Comment(2)
Do you have any explanation/interpretation of these results? At first glance, it looks like calls from C into Lua are only twice as slow as the other direction, which is far less significant of a difference than I would expect. But from looking at your benchmark I suspect what you're comparing is the difference between two calls from C into Lua and one; I don't think Lua functions casted to a ctype have comparable performance to actual C-implemented functions.Ascent
Would you be able to provide sum_push and sum_pull as pure C functions? I have not being able to properly compile C on my dev machine as of late.Mcgann

© 2022 - 2024 — McMap. All rights reserved.