Since this issue (and LJ in general) has been the source of great pain for me, I'd like to toss some extra information into the ring, in hopes that it may assist someone out there in the future.
'Callbacks' are Not Always Slow
The LuaJIT FFI documentation, when it says 'callbacks are slow,' is referring very specifically to the case of a callback created by LuaJIT and passed through FFI to a C function that expects a function pointer. This is completely different from other callback mechanisms, in particular, it has entirely different performance characteristics compared to calling a standard lua_CFunction that uses the API to invoke a callback.
With that said, the real question is then: when do we use the Lua C API to implement logic that involves pcall et al, vs. keeping everything in Lua? As always with performance, but especially in the case of a tracing JIT, one must profile (-jp) to know the answer. Period.
I have seen situations that looked similar yet fell on opposite ends of the performance spectrum; that is, I have encountered code (not toy code, but rather production code in the context of writing a high-perf game engine) that performs better when structured as Lua-only, as well as code (that seems structurally-similar) that performs better upon introducing a language boundary via calling a lua_CFunction that uses luaL_ref to maintain handles to callbacks and callback arguments.
Optimizing for LuaJIT without Measurement is a Fool's Errand
Tracing JITs are already hard to reason about, even if you're an expert in static language perf analysis. They take everything you thought you knew about performance and shatter it to pieces. If the concept of compiling recorded IR rather than compiling functions doesn't already annihilate one's ability to reason about LuaJIT performance, then the fact that calling into C via the FFI is more-or-less free when successfully JITed, yet potentially an order-of-magnitude more expensive than an equivalent lua_CFunction call when interpreted...well, this for sure pushes the situation over the edge.
Concretely, a system that you wrote last week that vastly out-performed a C equivalent may tank this week because you introduced an NYI in trace-proximity to said system, which may well have come from a seemingly-orthogonal region of code, and now your system is falling back and obliterating performance. Even worse, perhaps you're well-aware of what is and isn't an NYI, but you added just enough code to the trace proximity that it exceeded the JIT's max recorded IR instructions, max virtual registers, call depth, unroll factor, side trace limit...etc.
Also, note that, while 'empty' benchmarks can sometimes give a very general insight, it is even more important with LJ (for the aforementioned reasons) that code be profiled in context. It is very, very difficult to write representative performance benchmarks for LuaJIT, since traces are, by their nature, non-local. When using LJ in a large application, these non-local interactions become tremendously impactful.
TL;DR
There is exactly one person on this planet who really and truly understands the behavior of LuaJIT. His name is Mike Pall.
If you are not Mike Pall, do not assume anything about LJ behavior and performance. Use -jv (verbose; watch for NYIs and fallbacks), -jp (profiler! Combine with jit.zone for custom annotations; use -jp=vf to see what % of your time is being spent due in the interpreter due to fallbacks), and, when you really need to know what's going on, -jdump (trace IR & ASM). Measure, measure, measure. Take generalizations about LJ performance characteristics with a grain of salt unless they come from the man himself or you've measured them in your specific usage case (in which case, after all, it's not a generalization). And remember, the right solution might be all in Lua, it might be all in C, it might be Lua -> C through FFI, it might be Lua -> lua_CFunction -> Lua, ...you get the idea.
Coming from someone who has been fooled time-and-time-again into thinking that he has understood LuaJIT, only to be proven wrong the following week, I sincerely hope this information helps someone out there :) Personally, I simply no longer make 'educated guess' about LuaJIT. My engine outputs jv and jp logs for every run, and they are the 'word of God' for me with respect to optimization.