RPATH propagation failing for Python bindings
Asked Answered
D

1

5

I am building a library (Ubuntu 22) that uses onnxruntime under the hood. In turn, onnxruntime uses CUDA, dynamically loading some dedicated "backend". I build the whole code stack except the CUDA libraries, and none of the libraries have their RPATH or RUNPATH set (double-checked with readelf -d).

I build two apps, one is C++, and directly links to my library. The app has its RPATH set and everything works fine. If I run it with LD_DEBUG=libs I see stuff like this (note that the paths are edited and I'm showing only a tiny fraction of the debug output):

    158834:     calling init: .../install/bin/../lib/libonnxruntime_providers_cuda.so
    158834:
    158834:     find library=libcudnn_ops_infer.so.8 [0]; searching
    158834:      search path=.../install/bin/../lib         (RPATH from file .../install/bin/test)
    158834:       trying file=.../install/bin/../lib/libcudnn_ops_infer.so.8
    158834:
    158834:
    158834:     calling init: .../install/bin/../lib/libcudnn_ops_infer.so.8
    158834:

This is what I expect, I'm happy.

However, I also need to use the very same library through some python bindings that link against it. To have it working, I need to set in this case the RPATH of the python bindings (which, in my understanding at least, are just a shared library that gets loaded at runtime). Note that the Python executable doesn't have neither RPATH nor RUNPATH set. This works only in part. Namely, RPATH propagation seems to work while walking down the dependency tree until it starts searching for the CUDA libraries, at that point it doesn't work any more. This is running exactly the same onnxruntime API in the same way, same build, with the same files in the same folder as above. The only difference is the python extension layer. The LD_DEBUG output looks like this:

    159602:     find library=libonnxruntime.so.1.15.1 [0]; searching
    159602:      search path=.../install/lib/../lib         (RPATH from file .../install/lib/pyext.cpython-310-x86_64-linux-gnu.so)
    159602:       trying file=.../install/lib/../lib/libonnxruntime.so.1.15.1

[...]

    159602:     calling init: .../install/lib/pyext.cpython-310-x86_64-linux-gnu.so
    159602:
    159602:     find library=libonnxruntime_providers_shared.so [0]; searching
    159602:      search path=.../install/lib/../lib         (RPATH from file .../install/lib/pyext.cpython-310-x86_64-linux-gnu.so)
    159602:       trying file=.../install/lib/../lib/libonnxruntime_providers_shared.so
    159602:
    159602:
    159602:     calling init: .../install/lib/../lib/libonnxruntime_providers_shared.so
    159602:
    159602:     find library=libonnxruntime_providers_cuda.so [0]; searching
    159602:      search path=.../install/lib/../lib         (RPATH from file .../install/lib/pyext.cpython-310-x86_64-linux-gnu.so)
    159602:       trying file=.../install/lib/../lib/libonnxruntime_providers_cuda.so
    159602:
    159602:     find library=libcublas.so.11 [0]; searching
    159602:      search cache=/etc/ld.so.cache
    159602:      search path=/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3:/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2:/lib/x86_64-linux-gnu/tls/haswell/x86_64:/lib/x
86_64-linux-gnu/tls/haswell:/lib/x86_64-linux-gnu/tls/x86_64:/lib/x86_64-linux-gnu/tls:/lib/x86_64-linux-gnu/haswell/x86_64:/lib/x86_64-linux-gnu/haswell:/lib/x86_64-
linux-gnu/x86_64:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3:/usr/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2:/usr/lib/x86_64-linux-gnu/tls
/haswell/x86_64:/usr/lib/x86_64-linux-gnu/tls/haswell:/usr/lib/x86_64-linux-gnu/tls/x86_64:/usr/lib/x86_64-linux-gnu/tls:/usr/lib/x86_64-linux-gnu/haswell/x86_64:/usr
/lib/x86_64-linux-gnu/haswell:/usr/lib/x86_64-linux-gnu/x86_64:/usr/lib/x86_64-linux-gnu:/lib/glibc-hwcaps/x86-64-v3:/lib/glibc-hwcaps/x86-64-v2:/lib/tls/haswell/x86_
64:/lib/tls/haswell:/lib/tls/x86_64:/lib/tls:/lib/haswell/x86_64:/lib/haswell:/lib/x86_64:/lib:/usr/lib/glibc-hwcaps/x86-64-v3:/usr/lib/glibc-hwcaps/x86-64-v2:/usr/li
b/tls/haswell/x86_64:/usr/lib/tls/haswell:/usr/lib/tls/x86_64:/usr/lib/tls:/usr/lib/haswell/x86_64:/usr/lib/haswell:/usr/lib/x86_64:/usr/lib            (system search
 path)
    159602:       trying file=/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3/libcublas.so.11
    159602:       trying file=/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2/libcublas.so.11
    159602:       trying file=/lib/x86_64-linux-gnu/tls/haswell/x86_64/libcublas.so.11

 [...]

    159602:     calling fini: .../install/lib/../lib/libonnxruntime_providers_shared.so [0]

So basically libcublas is not found (nor any other of the CUDA libs), triggering a fallback mechanism in onnxruntime that avoids using CUDA.

Why does RPATH propagation work for the C++ app but not for the Python extension? Is there something silly I'm missing, or is it something deep related to how libraries are loaded in the context of a python session? Can it be the weird manifestation of a bug in onnxruntime, maybe doing something wrong with dlopen?

Note that the same issue seems to be present in the Python version of onnxruntime itself: Their setup.py makes sure that all dependencies are pre-loaded, using ctypes.CDLL with RTLD_GLOBAL.

Deadly answered 22/6, 2023 at 12:19 Comment(1)
Possibly related onnxruntime issue github.com/microsoft/onnxruntime/issues/9309Deadly
L
1

Following this link: https://wiki.debian.org/RpathIssue. The dynamic linker ld will look for a matching library in the following locations:

  1. the DT_RPATH dynamic section attribute of the library causing the lookup
  2. the DT_RPATH dynamic section attribute of the executable
  3. the LD_LIBRARY_PATH environment variable, unless the executable is setuid/setgid.
  4. the DT_RUNPATH dynamic section attribute of the executable
  5. /etc/ld.so.cache
  6. base library directories (/lib and /usr/lib)

So in your case:

  • when you use the c++ app that has its RPATH set, it succeeds because the rule 2 apply. For each sub or sub-sub library ld is using the app RPATH.
  • when you use python interpreter (which has no RPATH), when ld tries to load libonnxruntime from your binding lib (which has RPATH), it succeeds because the rule 1 apply
  • when ld from libonnxruntime (which has no RPATH) tries to load another lib (say libcublas) it fails, because no rule applies.

So to make libonnxruntime loads libcublas you must set RPATH on libonnxruntime too (so that rule 1 apply).

To help debugging that, one can use lddtree tool (apt install pax-utils) to get a hierarchical view of lib dependencies.

Lightness answered 22/9, 2023 at 20:57 Comment(1)
You might be right, I need to verify this. Thanks for your input, I'll come back to your answer as soon as possible.Deadly

© 2022 - 2024 — McMap. All rights reserved.