I want to create some neural network in tensorflow 2.x that trains on a GPU and I want to set up all the necessary infrastructure inside a docker-compose network (assuming that this is actually possible for now). As far as I know, in order to train a tensorflow model on a GPU, I need the CUDA toolkit and the NVIDIA driver. To install these dependencies natively on my computer (OS: Ubuntu 18.04) is always quite a pain, as there are many version dependencies between tensorflow, CUDA and the NVIDIA driver. So, I was trying to find a way how to create a docker-compose file that contains a service for tensorflow, CUDA and the NVIDIA driver, but I am getting the following error:
# Start the services
sudo docker-compose -f docker-compose-test.yml up --build
Starting vw_image_cls_nvidia-driver_1 ... done
Starting vw_image_cls_nvidia-cuda_1 ... done
Recreating vw_image_cls_tensorflow_1 ... error
ERROR: for vw_image_cls_tensorflow_1 Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown
ERROR: for tensorflow Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown
ERROR: Encountered errors while bringing up the project.
My docker-compose file looks as follows:
# version 2.3 is required for NVIDIA runtime
version: '2.3'
services:
nvidia-driver:
# NVIDIA GPU driver used by the CUDA Toolkit
image: nvidia/driver:440.33.01-ubuntu18.04
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
# Do we need this volume to make the driver accessible by other containers in the network?
- nvidia_driver:/usr/local/nvidai/:ro # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
networks:
- net
nvidia-cuda:
depends_on:
- nvidia-driver
image: nvidia/cuda:10.1-base-ubuntu18.04
volumes:
# Do we need the driver volume here?
- nvidia_driver:/usr/local/nvidai/:ro # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
# Do we need to create an additional volume for this service to be accessible by the tensorflow service?
devices:
# Do we need to list the devices here, or only in the tensorflow service. Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
- /dev/nvidiactl
- /dev/nvidia-uvm
- /dev/nvidia0
networks:
- net
tensorflow:
image: tensorflow/tensorflow:2.0.1-gpu # Does this ship with cuda10.0 installed or do I need a separate container for it?
runtime: nvidia
restart: always
privileged: true
depends_on:
- nvidia-cuda
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
# Volumes related to source code and config files
- ./src:/src
- ./configs:/configs
# Do we need the driver volume here?
- nvidia_driver:/usr/local/nvidai/:ro # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
# Do we need an additional volume from the nvidia-cuda service?
command: import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000]))); print("SUCCESS")
devices:
# Devices listed here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
- /dev/nvidiactl
- /dev/nvidia-uvm
- /dev/nvidia0
- /dev/nvidia-uvm-tools
networks:
- net
volumes:
nvidia_driver:
networks:
net:
driver: bridge
And my /etc/docker/daemon.json
file looks as follows:
{"default-runtime":"nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
So, it seems like the error is somehow related to configuring the nvidia runtime, but more importantly, I am almost certain that I didn't set up my docker-compose file correctly. So, my questions are:
- Is it actually possible to do what I am trying to do?
- If yes, did I setup my docker-compose file correctly (see comments in
docker-compose.yml
)? - How do I fix the error message I received above?
Thank you very much for your help, I highly appreciate it.
-gpu
flag on the docker image, see :hub.docker.com/r/tensorflow/tensorflow and NVIIDIA Container Toolkit (github.com/NVIDIA/nvidia-docker/blob/master/README.md) – Elfreda--gpu
flag, when executingdocker run ...
, but how would you do this when runningdocker-compose up
. According to the documentation of docker-compose up, there is no--gpu
... – Entreatydocker run ...
for you. You may provide arguments to a container in Compose usingcommand:
at the same level asimage:
,environment:
etc. You would havecommand:
. then below it- --gpu
. NB That's a single hyphen to indicate an array item forcommand
and then the double-hyphen preceedinggpu
. Alternatively (but messy) you can mix JSON w/ the YAML and write:command: ["--gpu"]
– Elfredacommand: ["/bin/sh -c", "--gpus all python", "import tensorflow as tf", "print(tf.reduce_sum(tf.random.normal([1000, 1000])))"]
. Is this what you mean? Unfortunately, I cannot test this right now... – Entreatycommand: ["/bin/sh -c", "python", "import tensorflow as tf", "print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
. You may be able to usecommand: ["python","-c","import ...."]
– Elfreda