Stopping and starting a deep learning google cloud VM instance causes tensorflow to stop recognizing GPU
Asked Answered
D

3

9

I am using the pre-built deep learning VM instances offered by google cloud, with an Nvidia tesla K80 GPU attached. I choose to have Tensorflow 2.5 and CUDA 11.0 automatically installed. When I start the instance, everything works great - I can run:

Import tensorflow as tf
tf.config.list_physical_devices()

And my function returns the CPU, accelerated CPU, and the GPU. Similarly, if I run tf.test.is_gpu_available(), the function returns True.

However, if I log out, stop the instance, and then restart the instance, running the same exact code only sees the CPU and tf.test.is_gpu_available() results in False. I get an error that looks like the driver initialization is failing:

 E tensorflow/stream_executor/cuda/cuda_driver.cc:355] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error

Running nvidia-smi shows that the computer still sees the GPU, but my tensorflow can’t see it.

Does anyone know what could be causing this? I don’t want to have to reinstall everything when I’m restarting the instance.

Darcydarda answered 24/6, 2021 at 16:32 Comment(7)
I have the same problem with this instance with Pytorch 1.8, after restarting I cannot get CUDA in pytorch. import torch torch.cuda.is_available() /opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1614378098133/work/c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0 FalseFullfaced
I have the same problem with Tensorflow 2.1 and CUDA 11.0, but it only cropped up in the last few days in both of my VMs. Do you know if Google changed anything about the Google Cloud configuration recently that may have lead to this issue?Mackmackay
@DanDan0101 according to tensorflow.org/install/source#gpu you need CUDA 10.1 for TensorFlow 2.1Adsorbent
I'm facing the same problem even the first time when I start the VM. tf.config.list_physical_devices() shows only CPU and tf.test.is_gpu_available() returns false.Adsorbent
@Adsorbent Interesting, do you know if there is backwards compatibility? My script seems to be running fine with CUDA 11.0 nowadaysMackmackay
@DanDan0101 version compatibility had always put me in trouble and wasted my time. Every time, and this list solved part of my problem. Six months ago Tensorflow 2.1 was not working with CUDA 11.0 on my local computer and I had to downgrade to CUDA 10.1. I don't know if google/tensorflow made some changes recently.Adsorbent
I'm sure the problem mentioned on this page is due to version compatibility as can also be seen here.Adsorbent
F
3

Some people (sadly not me) are able to resolve this by setting the following at the beginning of their script/main:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

I had to reinstall CUDA drivers and from then on it worked even after restarting the instance. You can configure your system settings on NVIDIAs website and it will provide you the commands you need to follow to install cuda. It also asks you if you want to uninstall the previous cuda version (yes!).This is luckily also very fast.

Fullfaced answered 25/6, 2021 at 9:11 Comment(2)
note: on the nvidia website you need to install with deb (network) instead of deb (local) and then it worked for me!Darcydarda
Oddly mine was working with local. But glad to have it resolved!Fullfaced
B
1

I fixed the same issue with the commands below, taken from https://issuetracker.google.com/issues/191612865?pli=1

gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh

chmod +x /tmp/restart_patch.sh

sudo /tmp/restart_patch.sh

sudo service jupyter restart
Brufsky answered 17/7, 2021 at 19:40 Comment(0)
A
0

Option-1:
Upgrade a Notebooks instance's environment. Refer the link to upgrade.
Notebooks instances that can be upgraded are dual-disk, with one boot disk and one data disk. The upgrade process upgrades the boot disk to a new image while preserving your data on the data disk.

Option-2:
Connect to the notebook VM via SSH and run the commands link.
After execution of the commands, the cuda version will update to 11.3 and the nvidia driver version to 465.19.01.
Restart the notebook VM.

Note: Issue has been solved in gpu images. New notebooks will be created with image version M74. About new image version is not yet updated in google-public-issue-tracker but you can find the new image version M74 in console.

Afire answered 29/6, 2021 at 13:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.