I am trying to compile a .cpp application which depends on LibTorch
, the cpp version of PyTorch
(https://pytorch.org/) on a HPC server.
I have loaded CUDA 11.8
via a module load.
nvcc -V
outputs
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0
With or without the CUDA module loaded, nvidia-smi
outputs:
Tue May 23 22:12:13 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:01:00.0 Off | 0 |
| N/A 27C P0 52W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I have loaded CMake
via a module load. Version 3.23.1.
I have loaded GCC-12.2.0
via a module load.
I downloaded libtorch from the official website and unzipped the archive. The latest release, called libtorch-shared-with-deps-2.0.1+cu118.zip
I created a CMakeLists.txt file just as recommended the libtorch documentation.
I use
$ cmake -DCMAKE_PREFIX_PATH=path_to_libtorch_folder ..
The CMakeLists.txt is:
cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(example-app)
set(CMAKE_C_COMPILER "gcc")
set(CMAKE_CXX_COMPILER "g++")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -pedantic -Wall")
find_package(Torch REQUIRED)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")
find_package(OpenMP)
add_executable(example-app example-app.cpp)
target_include_directories(example-app PUBLIC "/work4/clf/ouatu/trial_Murakami_CPP_SHTNS_PMD_LibTorch/shtns_install_omp_GNU/include")
target_link_directories(example-app PUBLIC "/work4/clf/ouatu/trial_Murakami_CPP_SHTNS_PMD_LibTorch/shtns_install_omp_GNU/lib")
target_link_directories(example-app PUBLIC "/opt/NVIDIA/NVIDIA-Linux-x86_64-460.73.01/")
if(OpenMP_CXX_FOUND)
target_link_libraries(example-app PUBLIC "${TORCH_LIBRARIES}" OpenMP::OpenMP_CXX fftw3_omp fftw3 m shtns_omp)
endif()
set_property(TARGET example-app PROPERTY CXX_STANDARD 17)
with its output:
-- The C compiler identification is GNU 11.3.0
-- The CXX compiler identification is GNU 11.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /apps20/sw/amd/GCCcore/11.3.0/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /apps20/sw/amd/GCCcore/11.3.0/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Warning (dev) at libtorch/share/cmake/Caffe2/public/cuda.cmake:29 (find_package):
Policy CMP0074 is not set: find_package uses <PackageName>_ROOT variables.
Run "cmake --help-policy CMP0074" for policy details. Use the cmake_policy
command to set the policy and suppress this warning.
Environment variable CUDA_ROOT is set to:
/apps20/sw/amd/CUDA/11.8.0
For compatibility, CMake is ignoring the variable.
Call Stack (most recent call first):
libtorch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
libtorch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:10 (find_package)
This warning is for project developers. Use -Wno-dev to suppress it.
-- Found CUDA: /apps20/sw/amd/CUDA/11.8.0 (found version "11.8")
-- The CUDA compiler identification is NVIDIA 11.8.89
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /apps20/sw/amd/CUDA/11.8.0/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Caffe2: CUDA detected: 11.8
-- Caffe2: CUDA nvcc is: /apps20/sw/amd/CUDA/11.8.0/bin/nvcc
-- Caffe2: CUDA toolkit directory: /apps20/sw/amd/CUDA/11.8.0
-- Caffe2: Header version is: 11.8
-- /apps20/sw/amd/CUDA/11.8.0/lib/libnvrtc.so shorthash is 672ee683
-- USE_CUDNN is set to 0. Compiling without cuDNN support
-- Autodetected CUDA architecture(s): 8.0
-- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80
-- Found Torch: /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libtorch.so
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Configuring done
CMake Warning at CMakeLists.txt:15 (add_executable):
Cannot generate a safe runtime search path for target example-app because
files in some directories may conflict with libraries in implicit
directories:
runtime library [libnvrtc.so.11.2] in /apps20/sw/amd/CUDA/11.8.0/lib may be hidden by files in:
/apps20/sw/amd/CUDA/11.8.0/lib/stubs
runtime library [libcufft.so.10] in /apps20/sw/amd/CUDA/11.8.0/lib64 may be hidden by files in:
/apps20/sw/amd/CUDA/11.8.0/lib/stubs
runtime library [libcurand.so.10] in /apps20/sw/amd/CUDA/11.8.0/lib64 may be hidden by files in:
/apps20/sw/amd/CUDA/11.8.0/lib/stubs
runtime library [libcublas.so.11] in /apps20/sw/amd/CUDA/11.8.0/lib64 may be hidden by files in:
/apps20/sw/amd/CUDA/11.8.0/lib/stubs
runtime library [libcublasLt.so.11] in /apps20/sw/amd/CUDA/11.8.0/lib64 may be hidden by files in:
/apps20/sw/amd/CUDA/11.8.0/lib/stubs
Some of these libraries may not be found correctly.
then I do:
$ cmake --build . --config Release
with its output:
Consolidate compiler generated dependencies of target example-app
[ 50%] Linking CXX executable example-app
[100%] Built target example-app
I then run $ ./example-app
and the output of std::cout << torch::cuda::is_available() << std::endl;
is 0
, thus the GPU is not recognised.
Also, a warning is output to the screen:
[W CUDAFunctions.cpp:109] Warning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (function operator())
From searching on the internet, it seems that at runtime the loader finds a stub library and not the driver library.
I do not know how to solve this.
The directory structure of where $ module load CUDA/11.8.0
points to is such that the stubs
folder is a subfolder of /apps20/sw/amd/CUDA/11.8.0/lib/
.
But LD_LIBRARY_PATH
is not recursive, isn't it?
Thus the option 2) presented here CMake cannot resolve runtime directory path is of no use to me.
Anyhow, the output of $echo $LD_LIBRARY_PATH
is:
/apps20/sw/amd/CUDA/11.8.0/nvvm/lib64:/apps20/sw/amd/CUDA/11.8.0/extras/CUPTI/lib64:/apps20/sw/amd/CUDA/11.8.0/lib:/apps20/sw/amd/libarchive/3.6.1-GCCcore-11.3.0/lib:/apps20/sw/amd/XZ/5.2.5-GCCcore-11.3.0/lib:/apps20/sw/amd/cURL/7.83.0-GCCcore-11.3.0/lib:/apps20/sw/amd/OpenSSL/1.1/lib:/apps20/sw/amd/bzip2/1.0.8-GCCcore-11.3.0/lib:/apps20/sw/amd/zlib/1.2.12-GCCcore-11.3.0/lib:/apps20/sw/amd/ncurses/6.3-GCCcore-11.3.0/lib:/apps20/sw/amd/GCCcore/11.3.0/lib64:/apps20/sw/amd/binutils/2.39-GCCcore-12.2.0/lib
For completeness, the output of echo $PATH
is:
/apps20/sw/amd/CUDA/11.8.0/nvvm/bin:/apps20/sw/amd/CUDA/11.8.0/bin:/apps20/sw/amd/CMake/3.23.1-GCCcore-11.3.0/bin:/apps20/sw/amd/libarchive/3.6.1-GCCcore-11.3.0/bin:/apps20/sw/amd/XZ/5.2.5-GCCcore-11.3.0/bin:/apps20/sw/amd/cURL/7.83.0-GCCcore-11.3.0/bin:/apps20/sw/amd/OpenSSL/1.1/bin:/apps20/sw/amd/bzip2/1.0.8-GCCcore-11.3.0/bin:/apps20/sw/amd/ncurses/6.3-GCCcore-11.3.0/bin:/apps20/sw/amd/GCCcore/11.3.0/bin:/apps20/sw/amd/binutils/2.39-GCCcore-12.2.0/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/fluka/bin:/opt/ibm/platform_mpi/bin:/home/vol02/scarf1032/.local/bin:/home/vol02/scarf1032/bin
And the output of echo $CUDA_HOME
is:
/apps20/sw/amd/CUDA/11.8.0
Similarly, option 1) is of no use to me, I cannot delete anything on the cluster. I have tried to $ module unload CUDA/11.8.0
before running the compiled app, but the compiled app does not run anymore if I do the module unload CUDA/11.8.0
, failing with ./example-app: error while loading shared libraries: libnvToolsExt.so.1: cannot open shared object file: No such file or directory
.
How could I run my compiled C++ app with it seeing the correct CUDA-driver libraries and not stub libraries?
I believe the driver libraries are at: /opt/NVIDIA/NVIDIA-Linux-x86_64-460.73.01/32/
, folder having the contents:
libEGL.so.1.1.0 libGLX_nvidia.so.460.73.01 libnvidia-compiler.so.460.73.01 libnvidia-ml.so.460.73.01
libEGL_nvidia.so.460.73.01 libGLdispatch.so.0 libnvidia-eglcore.so.460.73.01 libnvidia-opencl.so.460.73.01
libGL.so.1.7.0 libOpenCL.so.1.0.0 libnvidia-encode.so.460.73.01 libnvidia-opticalflow.so.460.73.01
libGLESv1_CM.so.1.2.0 libOpenGL.so.0 libnvidia-fbc.so.460.73.01 libnvidia-ptxjitcompiler.so.460.73.01
libGLESv1_CM_nvidia.so.460.73.01 libcuda.so.460.73.01 libnvidia-glcore.so.460.73.01 libnvidia-tls.so.460.73.01
libGLESv2.so.2.1.0 libglvnd_install_checker libnvidia-glsi.so.460.73.01 libvdpau_nvidia.so.460.73.01
libGLESv2_nvidia.so.460.73.01 libnvcuvid.so.460.73.01 libnvidia-glvkspirv.so.460.73.01
libGLX.so.0 libnvidia-allocator.so.460.73.01 libnvidia-ifr.so.460.73.01
EDIT: I checked with $ ldd example-app
and indeed stubs appear (for example, line 4 shows: libcuda.so.1 => /apps20/sw/amd/CUDA/11.8.0/lib/stubs/libcuda.so.1
):
linux-vdso.so.1 => (0x00007ffc29795000)
libtorch.so => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libtorch.so (0x00002ba944773000)
libc10.so => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libc10.so (0x00002ba944792000)
libcuda.so.1 => /apps20/sw/amd/CUDA/11.8.0/lib/stubs/libcuda.so.1 (0x00002ba944973000)
libnvrtc.so.11.2 => /apps20/sw/amd/CUDA/11.8.0/lib/stubs/libnvrtc.so.11.2 (0x00002ba944b82000)
libnvToolsExt.so.1 => /apps20/sw/amd/CUDA/11.8.0/lib/libnvToolsExt.so.1 (0x00002ba944d84000)
libcudart.so.11.0 => /apps20/sw/amd/CUDA/11.8.0/lib/libcudart.so.11.0 (0x00002ba944f8e000)
libc10_cuda.so => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libc10_cuda.so (0x00002ba94483e000)
libfftw3_omp.so.3 => /lib64/libfftw3_omp.so.3 (0x00002ba945235000)
libfftw3.so.3 => /lib64/libfftw3.so.3 (0x00002ba94543c000)
libtorch_cpu.so => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libtorch_cpu.so (0x00002ba9457c1000)
libtorch_cuda.so => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libtorch_cuda.so (0x00002ba95ed16000)
libcublas.so.11 => /apps20/sw/amd/CUDA/11.8.0/lib/stubs/libcublas.so.11 (0x00002ba9ad126000)
libgomp.so.1 => /apps20/sw/amd/GCCcore/11.3.0/lib64/libgomp.so.1 (0x00002ba9ad334000)
libstdc++.so.6 => /apps20/sw/amd/GCCcore/11.3.0/lib64/libstdc++.so.6 (0x00002ba9ad37a000)
libm.so.6 => /lib64/libm.so.6 (0x00002ba9ad58e000)
libgcc_s.so.1 => /apps20/sw/amd/GCCcore/11.3.0/lib64/libgcc_s.so.1 (0x00002ba9ad890000)
libc.so.6 => /lib64/libc.so.6 (0x00002ba9ad8aa000)
/lib64/ld-linux-x86-64.so.2 (0x00002ba94474f000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002ba9adc78000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ba9ade7c000)
librt.so.1 => /lib64/librt.so.1 (0x00002ba9ae098000)
libgomp-a34b3233.so.1 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libgomp-a34b3233.so.1 (0x00002ba9ae2a0000)
libcudart-d0da41ae.so.11.0 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libcudart-d0da41ae.so.11.0 (0x00002ba9ae4ca000)
libnvToolsExt-847d78f2.so.1 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libnvToolsExt-847d78f2.so.1 (0x00002ba9ae775000)
libcudnn.so.8 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libcudnn.so.8 (0x00002ba9ae980000)
libcublas-3b81d170.so.11 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libcublas-3b81d170.so.11 (0x00002ba9aeba6000)
libcublasLt-b6d14a74.so.11 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libcublasLt-b6d14a74.so.11 (0x00002ba9b4808000)
project()
is plain wrong: https://mcmap.net/q/56681/-how-to-specify-a-compiler-in-cmake. "From searching on the internet, it seems that at runtime the loader finds a stub library and not the driver library." - You could easily check your guesses about libraries found by the loader withldd example-app
. – Soutine$ unset LD_LIBRARY_PATH
, then run the compiled application, of course failed. Then I tried to manually remove only the first 3 entries ofLD_LIBRARY_PATH
, i.e./apps20/sw/amd/CUDA/11.8.0/nvvm/lib64:/apps20/sw/amd/CUDA/11.8.0/extras/CUPTI/lib64:/apps20/sw/amd/CUDA/11.8.0/lib:
, then run the app, it complains that./example-app: error while loading shared libraries: libnvToolsExt.so.1: cannot open shared object file: No such file or directory
, which is located in/apps20/sw/amd/CUDA/11.8.0/lib
, where a subfolder is calledstubs
... – Replace/apps20/sw/amd/CUDA/11.8.0/lib
at the end ofLD_LIBRARY_PATH
and remove only the other 2 entries, the compiled application runs, but it warns me that it doesn't see the GPU and we are back to the initial problem ... – ReplaceLD_LIBRARY_PATH
settings, then the runtime loader has its own set of "default" paths to check, independent of yourLD_LIBRARY_PATH
setting. If someone put the stub library paths in that (via ldconfig, as root) then I'm not sure you would be able to fix that, unless you are root. – Unpromising