tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
Asked Answered
G

4

13

I am trying to use GPU with Tensorflow. My Tensorflow version is 2.4.1 and I am using Cuda version 11.2. Here is the output of nvidia-smi.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce MX110       Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   52C    P0    N/A /  N/A |    254MiB /  2004MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1151      G   /usr/lib/xorg/Xorg                 37MiB |
|    0   N/A  N/A      1654      G   /usr/lib/xorg/Xorg                136MiB |
|    0   N/A  N/A      1830      G   /usr/bin/gnome-shell               68MiB |
|    0   N/A  N/A      5443      G   /usr/lib/firefox/firefox            0MiB |
|    0   N/A  N/A      5659      G   /usr/lib/firefox/firefox            0MiB |
+-----------------------------------------------------------------------------+

I am facing a strange issue. Previously when I was trying to list all the physical devices using tf.config.list_physical_devices() it was identifying one cpu and one gpu. AFter that I tried to do a simple matrix multiplication on the GPU. It failed with this error : failed to synchronize cuda stream CUDA_LAUNCH_ERROR (the error code was something like that, I forgot to note it). But after that when I again tried the same thing from another terminal, it failed to recognise any GPU. This time, listing physical devices produce this:

>>> tf.config.list_physical_devices()
2021-04-11 18:56:47.504776: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-11 18:56:47.507646: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-11 18:56:47.534189: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2021-04-11 18:56:47.534233: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: debadri-HP-Laptop-15g-dr0xxx
2021-04-11 18:56:47.534244: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: debadri-HP-Laptop-15g-dr0xxx
2021-04-11 18:56:47.534356: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 460.39.0
2021-04-11 18:56:47.534393: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.39.0
2021-04-11 18:56:47.534404: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 460.39.0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

My OS is Ubuntu 20.04, Python version 3.8.5 and Tensorflow , as mentioned before 2.4.1 with Cuda version 11.2. I installed cuda from these instructions. One additional piece of information; when I import tensorflow , it shows the following output:

import tensorflow as tf
2021-04-11 18:56:07.716683: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

What am I missing? Why is it failing to recognise the GPU even though it was recognising previously?

Glenoid answered 11/4, 2021 at 13:37 Comment(5)
These are the required versions. tensorflow.org/install/source#gpuCarmon
@Carmon Is my configuration not correct? I think I am using the versions mentioned in the linkGlenoid
Install CUDA toolkit 11.0 and reboot after sudo apt-get install nvidia-modprobe. ThanksWrenn
I have tensorflow 2.5 and cuda 11.0 and get the same error "failed call to cuInit: CUDA_ERROR_UNKNOWN: unknow error" What am I missingCalica
@Glenoid your nvidia-smi shows CUDA Version: 11.2 and import tensorflow shows libcudart.so.11.0. Why are these versions different? Version of TensorFlow and CUDA should be compatible according to tensorflow.org/install/source#gpuCalica
A
5

tldr: Disable Secure Boot before installing the Nvidia Driver.

I had the exact same error, and I spent a ton of time trying to figure out if I had installed Tensorflow related stuff incorrectly. After many hours of problem solving, I found that my NVIDIA driver was having some problems because I never disabled secure boot in my BIOS when setting up Ubuntu 20.4. Here's what I suggest (I opted for using Docker w/ Tensorflow, which avoids having to install all theCuda related stuff) - I hope it works for you!

  1. Disable Secure Boot in your BIOS
  2. Make a fresh install on Ubuntu 20.4
  3. Install Docker according to nvidia-container-toolkit's page.
curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker
  1. Install nvidia-container-toolkit from the same page.
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
  1. Test to make sure that's working with
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
  1. Finally, use Tensorflow with Docker w/ GPU support!
docker run --gpus all -u $(id -u):$(id -g) -it -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter jupyter notebook --ip=0.0.0.0
Arteritis answered 28/5, 2021 at 18:32 Comment(2)
Why is docker needed?Glenoid
It isn't necessary, but it helps make sure there are no version conflicts with Cuda, CuDNN, and the Nvidia driver. The tldr for solving the error we had, at least for me, is to disable secure boot before installing the Nvidia driver.Arteritis
M
3

I just made an account to say that @Nate's answer worked for me. I have the exact same setting as you and have been trying for two days.

What I did in the end was

Reboot - F10 to the setting - Security - BIOS Secure Boot (or something like that I don't remember exactly) - Disabled

Then there was some extra steps with the confirmation but it worked fine. I did not re-install the whole Unbuntu. It was a bit too technically risky for me.

Then I tried the tf.config line and I got this:

2021-06-14 17:12:19.546509: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1

2021-06-14 17:12:26.754680: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

2021-06-14 17:12:26.909679: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3593460000 Hz

2021-06-14 17:12:26.910016: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55a8352501c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:

2021-06-14 17:12:26.910040: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version

2021-06-14 17:12:26.972350: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1

2021-06-14 17:12:27.074861: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

2021-06-14 17:12:27.075289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:0c:00.0 name: GeForce GTX 1650 computeCapability: 7.5
coreClock: 1.665GHz coreCount: 14 deviceMemorySize: 3.81GiB deviceMemoryBandwidth: 119.24GiB/s

There are more red lines on devices properties towards the end but I got

Default GPU Device: /device:GPU:0

Don't know why it works, but it works. Just change the security boot setting.

I don't have enough experience points to upvote Nate's answer. I will come back later. But he/she really offers a good solution.

Maje answered 14/6, 2021 at 15:33 Comment(0)
G
3

Disabling Secure Boot solved the problem immediately. No need to reinstall anything.

> import tensorflow as tf
> tf.config.list_physical_devices("GPU")
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Gourd answered 12/3, 2022 at 17:14 Comment(0)
E
0

I solved with this command (even if I don't know why, comments are appreciated):

sudo systemctl restart display-manager
Efrem answered 29/3, 2024 at 13:18 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.