Using line_profiler with numba jitted functions
Asked Answered
S

2

13

Is it possible to use line_profiler with Numba?

Calling %lprun on a function decorated with @numba.jit returns an empty profile:

Timer unit: 1e-06 s

Total time: 0 s
File: <ipython-input-29-486f0a3cdf73>
Function: conv at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           @numba.jit
     2                                           def conv(f, w):
     3                                               f_full = np.zeros(np.int(f.size + (2 * w.size) - 2), dtype=np.float64)
     4                                               for i in range(0, f_full.size):
     5                                                   if i >= w.size - 1 and i < w.size + f.size - 1:
     6                                                       f_full[i] = f[i - w.size + 1]
     7                                               w = w[::-1]
     8                                               g = np.zeros(f_full.size-w.size + 1, dtype=np.float64)
     9                                               for i in range(0, f_full.size - w.size):
    10                                                   g[i] = np.sum(np.multiply(f_full[i:i+w.size], w))
    11                                               return g

There's a workaround for Cython code but can't find anything for Numba.

Subsistence answered 6/2, 2019 at 1:34 Comment(0)
S
9

TL;DR: Line-profiling a numba function might not be (technically) possible, but even if was possible to line-profile a numba function the results may not be accurate.

The problem with profilers and compiled/optimized languages

It's complex to use profilers with "compiled" languages (even to some extend with non-compiled languages depending on what the runtime is allowed to do), because compilers are allowed to rewrite your code. Just to name a few examples: constant folding, inline function calls, unroll loops (to take advantage of SIMD instructions), hoisting, and generally reorder/rearrange expressions (even over multiple lines). Generally the compiler is allowed to do anything as long as the result and the side-effects are "as if" the function wasn't "optimized".

Schematic:

+---------------+       +-------------+      +----------+
|  Source file  |   ->  |  Optimizer  |  ->  |  Result  |
+---------------+       +-------------+      +----------+

That's a problem because a profiler needs to insert statements into the code, for example a function profiler might insert a statement at the start and the beginning of each function, that might work even if the code is optimized and the function is inlined - simply because the "profiler statements" are inlined as well. However what if a compiler decides not to inline a function because of the additional profiler statements? Then what you profile might actually be different from how the "real program" would perform.

For example if you had (I use Python here even though it's not compiled, just assume I wrote such a program in C or so):

 def give_me_ten():
     return 10

 def main():
     n = give_me_ten()
     ...

Then the optimizer could rewrite it as:

 def main():
     n = 10  # <-- inline the function

However if you insert profiler statements:

 def give_me_ten():
     profile_start('give_me_ten')
     n = 10
     profile_end('give_me_ten')
     return n

 def main():
     profile_start('main')
     n = give_me_ten()
     ...
     profile_end('main')

The optimizer might just emit the same code because it doesn't inline the function.

A line-profiler actually insert a lot more "profiler statements" in your code. At the start and at the end of each line. That might prevent a lot of compiler optimizations. I'm not too familiar with the "as-if" rule but my guess would be that a lot of optimizations are impossible then. So your compiled program with profiler will behave significantly different from the compiled program without profiler.

For example if you had this program:

 def main():
     n = 1
     for _ in range(1000):
         n += 1
     ...

The optimizer could (not sure if any compiler would do that) rewrite it as:

 def main():
     n = 1001  # all statements are compile-time constants and no side-effects visible

However if you have line-profiling statements, then:

 def main():
     profile_start('main', line=1)
     n = 1
     profile_end('main', line=1)
     profile_start('main', line=2)
     for _ in range(1000):
         profile_end('main', line=2)
         profile_start('main', line=3)
         n += 1
         profile_end('main', line=3)
         profile_start('main', line=2)
     ...

Then by the "as-if" rule the loop has side-effects and cannot be condensed as single statement (maybe the code could still be optimized but not as a single statement).

Note that these are simplistic examples, compilers/optimizers are typically really sophisticated and have lots of possible optimizations.

Depending on the language, the compiler, and the profiler it may be possible to mitigate these effects. But it's unlikely that a Python-oriented profiler (such as line-profiler) targets C/C++ compilers.

Also note that's not a real problem with Python because Python just executes a program really step-by-step (not really true but Python very, very rarely changes your "written code" and then only in minor ways).

How does this apply to Numba and Cython?

  • Cython translates your Python code into C (or C++) code and then uses a C (or C++) compiler to compile it. Schematic:

    +-------------+    +--------+    +----------+    +-----------+    +--------+
    | Source file | -> | Cython | -> | C source | -> | Optimizer | -> | Result |
    +-------------+    +--------+    +----------+    +-----------+    +--------+
    
  • Numba translates your Python code depending on the argument types and uses LLVM to compile the code. Schematic:

    +-------------+    +-------+    +------------------+    +--------+
    | Source file | -> | Numba | -> | LLVM / Optimizer | -> | Result |
    +-------------+    +-------+    +------------------+    +--------+
    

Both have a compiler that may do extensive optimizations. A lot of the optimizations will not be possible if you insert the profiling statements into your code before you compile it. So even if it would be possible to line-profile the code the results may not be accurate (accurate in the sense of that the real program would perform that way).

Line-profiler was written for pure Python so I wouldn't necessarily trust the output for Cython/Numba if it worked. It may give some hints but overall it may just be too inaccurate.

Especially Numba could be really tricky because the numba translator would need to support the profiling statements (otherwise you would end up with a object-mode numba function which would yield totally inaccurate results) and your jitted function isn't just one function anymore. It's actually a dispatcher that delegates to an "hidden" function depending on the type of the arguments. So when you call the same "dispatcher" with an int or a float it could execute a totally different function. Interesting fact: The act of profiling with a function-profiler already imposes significant overhead because the numba developers wanted to make that work (see cProfile adds significant overhead when calling numba jit functions).

Okay, how to profile them?

You should probably profile with a profiler that can work with the compiler on the translated code. These can (probably) produce more accurate results than a profiler written for Python code. It will be more complicated because these profilers will return results for the translated code that have to be transferred manually to the original code again. Also it might not even be possible - typically Cython/Numba manage the translation and compilation and execution of the result so you need to check if they provide hooks for the additional profiler. I have no experience there.

And as a general rule: If you have optimizers then always treat profilings as a "guide" not necessarily as "fact". And always use profilers that are designed for a compiler/optimizer, otherwise you'll loose a lot of reliability and/or accuracy.

Spellman answered 7/2, 2019 at 10:49 Comment(3)
If I don't want to line-profile the jitted function itself, but I want to get rid of the compilation/cache-loading bias, is there a way to preload the compilation caches to RAM, or run the profiler twice, so that the kernproffed script doesn't have to reload the caches?Liquor
@Liquor If you don't want to line-profile the numba function what do you want to profile? Eliminiating the compilation overhead with numba functions is as easy as running the function once before you start profiling (if both happen in the same python process). There is also a cache argument for numba functions that should trigger ahead-of-time compilation instead of just-in-time.Spellman
I want to profile other python functions that may or may not call already numbized functions. Cache doesn't help, because while it gets rid of the compile time, it still takes a significant time to read the cache from the filesystem.Liquor
S
1

I've implemented a line profiler for Numba: https://github.com/pythonspeed/profila

It does come with caveats, since optimization does indeed make mapping source code tricky, but it can be very helpful in some cases at least.

Southernmost answered 2/2 at 18:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.