Tensorflow 2.0 can't use GPU, something wrong in cuDNN? :Failed to get convolution algorithm. This is probably because cuDNN failed to initialize
Asked Answered
C

4

5

I am trying to understand and debug my code. I try to predict with a CNN model developed under tf2.0/tf.keras on GPU, but get those error messages. could someone help me to fix it?

here is my environmental configuration

enviroments:
python 3.6.8
tensorflow-gpu 2.0.0-rc0
nvidia 418.x
CUDA 10.0
cuDNN 7.6+**

and the log file,

2019-09-28 13:10:59.833892: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-09-28 13:11:00.228025: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-09-28 13:11:00.957534: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-09-28 13:11:00.963310: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-09-28 13:11:00.963416: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node mobilenetv2_1.00_192/Conv1/Conv2D}}]]
mobilenetv2_1.00_192/block_15_expand_BN/cond/then/_630/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0=====>GPU Available:  True
=====> 4 Physical GPUs, 1 Logical GPUs

mobilenetv2_1.00_192/block_15_expand_BN/cond/then/_630/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_15_depthwise_BN/cond/then/_644/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_15_depthwise_BN/cond/then/_644/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_15_project_BN/cond/then/_658/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_15_project_BN/cond/then/_658/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_expand_BN/cond/then/_672/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_expand_BN/cond/then/_672/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_depthwise_BN/cond/then/_686/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_depthwise_BN/cond/then/_686/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_project_BN/cond/then/_700/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_project_BN/cond/then/_700/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/Conv_1_bn/cond/then/_714/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/Conv_1_bn/cond/then/_714/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Traceback (most recent call last):
  File "NSFW_Server.py", line 162, in <module>
    model.predict(initial_tensor)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 915, in predict
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 722, in predict
    callbacks=callbacks)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 393, in model_iteration
    batch_outs = f(ins_batch)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3625, in __call__
    outputs = self._graph_fn(*converted_inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1081, in __call__
    return self._call_impl(args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1121, in _call_impl
    return self._call_flat(args, self.captured_inputs, cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 511, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node mobilenetv2_1.00_192/Conv1/Conv2D (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_keras_scratch_graph_10727]

Function call stack:
keras_scratch_graph

The code

if __name__ == "__main__":

    print("=====>GPU Available: ", tf.test.is_gpu_available())
    tf.debugging.set_log_device_placement(True)

    gpus = tf.config.experimental.list_physical_devices('GPU')
    if gpus:
        try:
            # Currently, memory growth needs to be the same across GPUs

            tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
            tf.config.experimental.set_memory_growth(gpus[0], True)
            logical_gpus = tf.config.experimental.list_logical_devices('GPU')
            print("=====>", len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
        except RuntimeError as e:
            # Memory growth must be set before GPUs have been initialized
            print(e)

    paras_path = "./paras/{}".format(int(2011))
    model = tf.keras.experimental.load_from_saved_model(paras_path)
    initial_tensor = np.zeros((1, INPUT_SHAPE, INPUT_SHAPE, 3))
    model.predict(initial_tensor)
Colossal answered 28/9, 2019 at 5:29 Comment(0)
C
13

You have to check that you have the right version of CUDA + CUDNN + TensorFlow (also ensure that you have all installed).

A couple of examples of running configurations are presented below(UPDATE FOR LATEST VERSIONS OF TENSORFLOW)

  1. Cuda 11.3.1 + CuDNN 8.2.1.32 + TensorFlow 2.7.0

  2. Cuda 11.0 + CuDNN 8.0.4 + TensorFlow 2.4.0

  3. Cuda 10.1 + CuDNN 7.6.5 (normally > 7.6) + TensorFlow 2.2.0/TensorFlow 2.3.0 (TF >= 2.1 requires CUDA >=10.1)

  4. Cuda 10.1 + CuDNN 7.6.5 (normally > 7.6) + TensorFlow 2.1.0 (TF >= 2.1 requires CUDA >= 10.1)

  5. Cuda 10.0 + CuDNN 7.6.3 + / TensorFlow 1.13/1.14 / TensorFlow 2.0.

  6. Cuda 9.0 + CuDNN 7.0.5 + TensorFlow 1.10

Usually this error appears when you have an incompatible version of TensorFlow/CuDNN installed. In my case, this appeared when I tried using an older TensorFlow with a newer version of CuDNN.

**If for some reason you get an error message like(and nothing happens afterwards) :

Relying on the driver to perform ptx compilation

Solution : Install the latest nvidia driver

[SEEMS TO BE SOLVED IN TF >= 2.5.0] (see below):

Only for Windows Users : Some late combintations of CUDA, CUDNN and TF may not work, due to a bug (a .dll extension named improperly). To handle that specific case, please consult this link: Tensorflow GPU Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found

Cumulation answered 28/9, 2019 at 5:58 Comment(11)
thank you, to make sure that CUDA/cuDNN/TF are right version, I pulled an image from docker hub which is "tensorflow/tensorflow:2.0.0rc0-gpu-py3", and ran my code in container...But it still doesn't work and occurs same error massages.Colossal
Try to install them by hand on your own and then check again with the dependencies installed in the Docker image. There must be a slight difference that you are missing.Cumulation
Thanks, Have anyone tried with any above versions ? like cuda 10.1 + CuDNN 7.64.Justinajustine
I have modified the answer to be clearer. Cuda 10.0 not only 10, because there is a difference between 10.0 and 10.1.Cumulation
@Justinajustine I have updated the answer for the latest TensorFlowCumulation
How about Tensorflow 2.2.0? Which Cuda and CuDNN version are compatible?Haland
@Haland I have to check it out, for the moment on one of my PCs I have CUDA 10.0 + TF 1.14 and on my laptop CUDA 10.1 + TF 2.1 with the CUDNNs mentioned above. Unfortunately it won't be until the beginning of July when I try to see the other configuration.Cumulation
@mobinalhassan you will need CUDA 10.1Cumulation
@TimbusCalin I have install CUDA Version 10.2.89 how i'm able to downgrade this...i'm having the problem of memory out of runBenedikta
You need to uninstall completely CUDA and then perform a clean installation of CUDA 10.1.Cumulation
@VinSentTeZla did my answer solve your problem in the end?Cumulation
G
0

For those who are facing issues regarding the above error(For Windows platform), I sorted it just by installing CuDNN version compatible with the CUDA already installed in the system.

    • This suitable version can be downloaded from the website Download CuDNN from Developer's portal. You might need Nvidia account for it. This will be easily created by providing mail id and filling a questionnaire.
    • To check the CUDA version, run NVCC --version.
    • Once the suitable version is downloaded, extract the folder from the zip file.
    • Go to the bin folder of the extracted folder. copy the cudnn64:7.dll and paste it in the CUDA's bin folder. In my case, the location where Cuda is installed is C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin.
    • This would most probably solve the problem.

My system details:

  1. Windows 10
  2. CUDA 10.0
  3. TensorFlow 2.0
  4. GPU- Nvidia GTX 1060

I also found this blog Installing TensorFlow with CUDA and GPU support on Windows 10. very useful.

Giesecke answered 9/2, 2021 at 8:4 Comment(0)
J
0

before cuda10.1 + cudnn8.0.5 ,by change cudnn7.6 solve the problem.

Jallier answered 21/8, 2023 at 13:8 Comment(0)
N
-1

Check the instructions on this TensorFlow GPU instruction page for your OS. It resolved issue for me on Ubuntu 16.04.6 LTS and Tensorflow 2.0

Neile answered 18/10, 2019 at 18:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.