libtorch and CUDA stub library loaded at runtime problem
Asked Answered
R

1

1

I am trying to compile a .cpp application which depends on LibTorch, the cpp version of PyTorch (https://pytorch.org/) on a HPC server.

I have loaded CUDA 11.8 via a module load.

nvcc -V outputs

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

With or without the CUDA module loaded, nvidia-smi outputs:

Tue May 23 22:12:13 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   27C    P0    52W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I have loaded CMake via a module load. Version 3.23.1.

I have loaded GCC-12.2.0 via a module load.

I downloaded libtorch from the official website and unzipped the archive. The latest release, called libtorch-shared-with-deps-2.0.1+cu118.zip

I created a CMakeLists.txt file just as recommended the libtorch documentation.

I use

$ cmake -DCMAKE_PREFIX_PATH=path_to_libtorch_folder ..

The CMakeLists.txt is:

cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(example-app)

set(CMAKE_C_COMPILER "gcc")
set(CMAKE_CXX_COMPILER "g++")

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -pedantic -Wall")


find_package(Torch REQUIRED)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")

find_package(OpenMP)

add_executable(example-app example-app.cpp)

target_include_directories(example-app PUBLIC "/work4/clf/ouatu/trial_Murakami_CPP_SHTNS_PMD_LibTorch/shtns_install_omp_GNU/include")
target_link_directories(example-app PUBLIC "/work4/clf/ouatu/trial_Murakami_CPP_SHTNS_PMD_LibTorch/shtns_install_omp_GNU/lib")

target_link_directories(example-app PUBLIC "/opt/NVIDIA/NVIDIA-Linux-x86_64-460.73.01/")

if(OpenMP_CXX_FOUND)
target_link_libraries(example-app PUBLIC "${TORCH_LIBRARIES}" OpenMP::OpenMP_CXX fftw3_omp fftw3 m shtns_omp)
endif()

set_property(TARGET example-app PROPERTY CXX_STANDARD 17)

with its output:

-- The C compiler identification is GNU 11.3.0
-- The CXX compiler identification is GNU 11.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /apps20/sw/amd/GCCcore/11.3.0/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /apps20/sw/amd/GCCcore/11.3.0/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Warning (dev) at libtorch/share/cmake/Caffe2/public/cuda.cmake:29 (find_package):
  Policy CMP0074 is not set: find_package uses <PackageName>_ROOT variables.
  Run "cmake --help-policy CMP0074" for policy details.  Use the cmake_policy
  command to set the policy and suppress this warning.

  Environment variable CUDA_ROOT is set to:

    /apps20/sw/amd/CUDA/11.8.0

  For compatibility, CMake is ignoring the variable.
Call Stack (most recent call first):
  libtorch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  libtorch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:10 (find_package)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Found CUDA: /apps20/sw/amd/CUDA/11.8.0 (found version "11.8")
-- The CUDA compiler identification is NVIDIA 11.8.89
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /apps20/sw/amd/CUDA/11.8.0/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Caffe2: CUDA detected: 11.8
-- Caffe2: CUDA nvcc is: /apps20/sw/amd/CUDA/11.8.0/bin/nvcc
-- Caffe2: CUDA toolkit directory: /apps20/sw/amd/CUDA/11.8.0
-- Caffe2: Header version is: 11.8
-- /apps20/sw/amd/CUDA/11.8.0/lib/libnvrtc.so shorthash is 672ee683
-- USE_CUDNN is set to 0. Compiling without cuDNN support
-- Autodetected CUDA architecture(s):  8.0
-- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80
-- Found Torch: /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libtorch.so
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Configuring done
CMake Warning at CMakeLists.txt:15 (add_executable):
  Cannot generate a safe runtime search path for target example-app because
  files in some directories may conflict with libraries in implicit
  directories:

    runtime library [libnvrtc.so.11.2] in /apps20/sw/amd/CUDA/11.8.0/lib may be hidden by files in:
      /apps20/sw/amd/CUDA/11.8.0/lib/stubs
    runtime library [libcufft.so.10] in /apps20/sw/amd/CUDA/11.8.0/lib64 may be hidden by files in:
      /apps20/sw/amd/CUDA/11.8.0/lib/stubs
    runtime library [libcurand.so.10] in /apps20/sw/amd/CUDA/11.8.0/lib64 may be hidden by files in:
      /apps20/sw/amd/CUDA/11.8.0/lib/stubs
    runtime library [libcublas.so.11] in /apps20/sw/amd/CUDA/11.8.0/lib64 may be hidden by files in:
      /apps20/sw/amd/CUDA/11.8.0/lib/stubs
    runtime library [libcublasLt.so.11] in /apps20/sw/amd/CUDA/11.8.0/lib64 may be hidden by files in:
      /apps20/sw/amd/CUDA/11.8.0/lib/stubs

  Some of these libraries may not be found correctly.

then I do:

$ cmake --build . --config Release

with its output:

Consolidate compiler generated dependencies of target example-app
[ 50%] Linking CXX executable example-app
[100%] Built target example-app

I then run $ ./example-app and the output of std::cout << torch::cuda::is_available() << std::endl; is 0, thus the GPU is not recognised. Also, a warning is output to the screen:

[W CUDAFunctions.cpp:109] Warning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (function operator())

From searching on the internet, it seems that at runtime the loader finds a stub library and not the driver library.

I do not know how to solve this.

The directory structure of where $ module load CUDA/11.8.0 points to is such that the stubs folder is a subfolder of /apps20/sw/amd/CUDA/11.8.0/lib/.

But LD_LIBRARY_PATH is not recursive, isn't it? Thus the option 2) presented here CMake cannot resolve runtime directory path is of no use to me.

Anyhow, the output of $echo $LD_LIBRARY_PATH is:

/apps20/sw/amd/CUDA/11.8.0/nvvm/lib64:/apps20/sw/amd/CUDA/11.8.0/extras/CUPTI/lib64:/apps20/sw/amd/CUDA/11.8.0/lib:/apps20/sw/amd/libarchive/3.6.1-GCCcore-11.3.0/lib:/apps20/sw/amd/XZ/5.2.5-GCCcore-11.3.0/lib:/apps20/sw/amd/cURL/7.83.0-GCCcore-11.3.0/lib:/apps20/sw/amd/OpenSSL/1.1/lib:/apps20/sw/amd/bzip2/1.0.8-GCCcore-11.3.0/lib:/apps20/sw/amd/zlib/1.2.12-GCCcore-11.3.0/lib:/apps20/sw/amd/ncurses/6.3-GCCcore-11.3.0/lib:/apps20/sw/amd/GCCcore/11.3.0/lib64:/apps20/sw/amd/binutils/2.39-GCCcore-12.2.0/lib

For completeness, the output of echo $PATH is:

/apps20/sw/amd/CUDA/11.8.0/nvvm/bin:/apps20/sw/amd/CUDA/11.8.0/bin:/apps20/sw/amd/CMake/3.23.1-GCCcore-11.3.0/bin:/apps20/sw/amd/libarchive/3.6.1-GCCcore-11.3.0/bin:/apps20/sw/amd/XZ/5.2.5-GCCcore-11.3.0/bin:/apps20/sw/amd/cURL/7.83.0-GCCcore-11.3.0/bin:/apps20/sw/amd/OpenSSL/1.1/bin:/apps20/sw/amd/bzip2/1.0.8-GCCcore-11.3.0/bin:/apps20/sw/amd/ncurses/6.3-GCCcore-11.3.0/bin:/apps20/sw/amd/GCCcore/11.3.0/bin:/apps20/sw/amd/binutils/2.39-GCCcore-12.2.0/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/fluka/bin:/opt/ibm/platform_mpi/bin:/home/vol02/scarf1032/.local/bin:/home/vol02/scarf1032/bin

And the output of echo $CUDA_HOME is:

/apps20/sw/amd/CUDA/11.8.0

Similarly, option 1) is of no use to me, I cannot delete anything on the cluster. I have tried to $ module unload CUDA/11.8.0 before running the compiled app, but the compiled app does not run anymore if I do the module unload CUDA/11.8.0, failing with ./example-app: error while loading shared libraries: libnvToolsExt.so.1: cannot open shared object file: No such file or directory.

How could I run my compiled C++ app with it seeing the correct CUDA-driver libraries and not stub libraries?

I believe the driver libraries are at: /opt/NVIDIA/NVIDIA-Linux-x86_64-460.73.01/32/, folder having the contents:

libEGL.so.1.1.0                   libGLX_nvidia.so.460.73.01        libnvidia-compiler.so.460.73.01   libnvidia-ml.so.460.73.01
libEGL_nvidia.so.460.73.01        libGLdispatch.so.0                libnvidia-eglcore.so.460.73.01    libnvidia-opencl.so.460.73.01
libGL.so.1.7.0                    libOpenCL.so.1.0.0                libnvidia-encode.so.460.73.01     libnvidia-opticalflow.so.460.73.01
libGLESv1_CM.so.1.2.0             libOpenGL.so.0                    libnvidia-fbc.so.460.73.01        libnvidia-ptxjitcompiler.so.460.73.01
libGLESv1_CM_nvidia.so.460.73.01  libcuda.so.460.73.01              libnvidia-glcore.so.460.73.01     libnvidia-tls.so.460.73.01
libGLESv2.so.2.1.0                libglvnd_install_checker          libnvidia-glsi.so.460.73.01       libvdpau_nvidia.so.460.73.01
libGLESv2_nvidia.so.460.73.01     libnvcuvid.so.460.73.01           libnvidia-glvkspirv.so.460.73.01
libGLX.so.0                       libnvidia-allocator.so.460.73.01  libnvidia-ifr.so.460.73.01

EDIT: I checked with $ ldd example-app and indeed stubs appear (for example, line 4 shows: libcuda.so.1 => /apps20/sw/amd/CUDA/11.8.0/lib/stubs/libcuda.so.1 ):

linux-vdso.so.1 =>  (0x00007ffc29795000)
libtorch.so => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libtorch.so (0x00002ba944773000)
libc10.so => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libc10.so (0x00002ba944792000)
libcuda.so.1 => /apps20/sw/amd/CUDA/11.8.0/lib/stubs/libcuda.so.1 (0x00002ba944973000)
libnvrtc.so.11.2 => /apps20/sw/amd/CUDA/11.8.0/lib/stubs/libnvrtc.so.11.2 (0x00002ba944b82000)
libnvToolsExt.so.1 => /apps20/sw/amd/CUDA/11.8.0/lib/libnvToolsExt.so.1 (0x00002ba944d84000)
libcudart.so.11.0 => /apps20/sw/amd/CUDA/11.8.0/lib/libcudart.so.11.0 (0x00002ba944f8e000)
libc10_cuda.so => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libc10_cuda.so (0x00002ba94483e000)
libfftw3_omp.so.3 => /lib64/libfftw3_omp.so.3 (0x00002ba945235000)
libfftw3.so.3 => /lib64/libfftw3.so.3 (0x00002ba94543c000)
libtorch_cpu.so => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libtorch_cpu.so (0x00002ba9457c1000)
libtorch_cuda.so => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libtorch_cuda.so (0x00002ba95ed16000)
libcublas.so.11 => /apps20/sw/amd/CUDA/11.8.0/lib/stubs/libcublas.so.11 (0x00002ba9ad126000)
libgomp.so.1 => /apps20/sw/amd/GCCcore/11.3.0/lib64/libgomp.so.1 (0x00002ba9ad334000)
libstdc++.so.6 => /apps20/sw/amd/GCCcore/11.3.0/lib64/libstdc++.so.6 (0x00002ba9ad37a000)
libm.so.6 => /lib64/libm.so.6 (0x00002ba9ad58e000)
libgcc_s.so.1 => /apps20/sw/amd/GCCcore/11.3.0/lib64/libgcc_s.so.1 (0x00002ba9ad890000)
libc.so.6 => /lib64/libc.so.6 (0x00002ba9ad8aa000)
/lib64/ld-linux-x86-64.so.2 (0x00002ba94474f000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002ba9adc78000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ba9ade7c000)
librt.so.1 => /lib64/librt.so.1 (0x00002ba9ae098000)
libgomp-a34b3233.so.1 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libgomp-a34b3233.so.1 (0x00002ba9ae2a0000)
libcudart-d0da41ae.so.11.0 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libcudart-d0da41ae.so.11.0 (0x00002ba9ae4ca000)
libnvToolsExt-847d78f2.so.1 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libnvToolsExt-847d78f2.so.1 (0x00002ba9ae775000)
libcudnn.so.8 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libcudnn.so.8 (0x00002ba9ae980000)
libcublas-3b81d170.so.11 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libcublas-3b81d170.so.11 (0x00002ba9aeba6000)
libcublasLt-b6d14a74.so.11 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libcublasLt-b6d14a74.so.11 (0x00002ba9b4808000)
Replace answered 23/5, 2023 at 21:13 Comment(8)
Note, that setting a compiler after the project() is plain wrong: https://mcmap.net/q/56681/-how-to-specify-a-compiler-in-cmake. "From searching on the internet, it seems that at runtime the loader finds a stub library and not the driver library." - You could easily check your guesses about libraries found by the loader with ldd example-app.Soutine
@Tsyvarev, indeed, the libraries found by the loader are stubs! I will update my answer with this new information. I will also read about CMake and change where I define the compilers, please excuse my plain inability to use it correctly - it's the first time I came across writing a CMakeLists.txt due to LibTorch docu onlineReplace
Having changed where I define the compilers (s.t. the correct GCC 11.3.0 is found and not the system wide version 4.5.0 or something similar), the same problem appears: the loader, at runtime, finds nvidia stubs and not the driver libraries.Replace
Plainly you need to set the LD_LIBRARY_PATH correctly for the system you are running this code onThagard
@talonmies, thank you. I tried to $ unset LD_LIBRARY_PATH, then run the compiled application, of course failed. Then I tried to manually remove only the first 3 entries of LD_LIBRARY_PATH, i.e. /apps20/sw/amd/CUDA/11.8.0/nvvm/lib64:/apps20/sw/amd/CUDA/11.8.0/extras/CUPTI/lib64:/apps20/sw/amd/CUDA/11.8.0/lib:, then run the app, it complains that ./example-app: error while loading shared libraries: libnvToolsExt.so.1: cannot open shared object file: No such file or directory, which is located in /apps20/sw/amd/CUDA/11.8.0/lib, where a subfolder is called stubs ...Replace
@talonmies, I then tried to leave only /apps20/sw/amd/CUDA/11.8.0/lib at the end of LD_LIBRARY_PATH and remove only the other 2 entries, the compiled application runs, but it warns me that it doesn't see the GPU and we are back to the initial problem ...Replace
if the runtime loader is finding the stub path, you have a problem. If you have ruled out errant LD_LIBRARY_PATH settings, then the runtime loader has its own set of "default" paths to check, independent of your LD_LIBRARY_PATH setting. If someone put the stub library paths in that (via ldconfig, as root) then I'm not sure you would be able to fix that, unless you are root.Unpromising
@RobertCrovella, thank you. what would be the solution then, as being the root?Replace
R
1

The system administrator solved the problem.

In case anyone is having the problem I posted above, the solution was to use, when creating the cpp application using CMake, the following flags:

cmake -DCMAKE_PREFIX_PATH<path_to_your_libtorch> -D CUDA_CUDA_LIB=/usr/lib64/libcuda.so ..

This forces to link against the NVidia driver version of libcuda.so.

After this, when I print inside my cpp application: std::cout << torch::cuda::is_available() << std::endl;, this outputs 1, rather than 0, as it did before.

The warning also disappears.

Replace answered 25/5, 2023 at 9:39 Comment(1)
Thanks for this question and answer and enduring the entirely unwarranted toxicity from stock overflow devsBrothers

© 2022 - 2024 — McMap. All rights reserved.