How to run tensorflow with gpu support in docker-compose?
Asked Answered
B

3

25

I want to create some neural network in tensorflow 2.x that trains on a GPU and I want to set up all the necessary infrastructure inside a docker-compose network (assuming that this is actually possible for now). As far as I know, in order to train a tensorflow model on a GPU, I need the CUDA toolkit and the NVIDIA driver. To install these dependencies natively on my computer (OS: Ubuntu 18.04) is always quite a pain, as there are many version dependencies between tensorflow, CUDA and the NVIDIA driver. So, I was trying to find a way how to create a docker-compose file that contains a service for tensorflow, CUDA and the NVIDIA driver, but I am getting the following error:

# Start the services
sudo docker-compose -f docker-compose-test.yml up --build

Starting vw_image_cls_nvidia-driver_1 ... done
Starting vw_image_cls_nvidia-cuda_1   ... done
Recreating vw_image_cls_tensorflow_1  ... error

ERROR: for vw_image_cls_tensorflow_1  Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown

ERROR: for tensorflow  Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown
ERROR: Encountered errors while bringing up the project.

My docker-compose file looks as follows:

# version 2.3 is required for NVIDIA runtime
version: '2.3'

services:
  nvidia-driver:
    # NVIDIA GPU driver used by the CUDA Toolkit
    image: nvidia/driver:440.33.01-ubuntu18.04
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
    # Do we need this volume to make the driver accessible by other containers in the network?
      - nvidia_driver:/usr/local/nvidai/:ro  # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
    networks:
      - net

  nvidia-cuda:
    depends_on:
      - nvidia-driver
    image: nvidia/cuda:10.1-base-ubuntu18.04
    volumes:
    # Do we need the driver volume here?
     - nvidia_driver:/usr/local/nvidai/:ro  # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
     # Do we need to create an additional volume for this service to be accessible by the tensorflow service?
    devices:
      # Do we need to list the devices here, or only in the tensorflow service. Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
      - /dev/nvidiactl
      - /dev/nvidia-uvm
      - /dev/nvidia0
    networks:
      - net

  tensorflow:
    image: tensorflow/tensorflow:2.0.1-gpu  # Does this ship with cuda10.0 installed or do I need a separate container for it?
    runtime: nvidia
    restart: always
    privileged: true
    depends_on:
      - nvidia-cuda
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      # Volumes related to source code and config files
      - ./src:/src
      - ./configs:/configs
      # Do we need the driver volume here?
      - nvidia_driver:/usr/local/nvidai/:ro  # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
      # Do we need an additional volume from the nvidia-cuda service?
    command: import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000]))); print("SUCCESS")
    devices:
      # Devices listed here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
      - /dev/nvidiactl
      - /dev/nvidia-uvm
      - /dev/nvidia0
      - /dev/nvidia-uvm-tools
    networks:
      - net

volumes:
  nvidia_driver:

networks:
  net:
    driver: bridge

And my /etc/docker/daemon.json file looks as follows:

{"default-runtime":"nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

So, it seems like the error is somehow related to configuring the nvidia runtime, but more importantly, I am almost certain that I didn't set up my docker-compose file correctly. So, my questions are:

  1. Is it actually possible to do what I am trying to do?
  2. If yes, did I setup my docker-compose file correctly (see comments in docker-compose.yml)?
  3. How do I fix the error message I received above?

Thank you very much for your help, I highly appreciate it.

Baggy answered 26/2, 2020 at 16:16 Comment(8)
I have not done this, but... You need to use the -gpu flag on the docker image, see :hub.docker.com/r/tensorflow/tensorflow and NVIIDIA Container Toolkit (github.com/NVIDIA/nvidia-docker/blob/master/README.md)Elfreda
Hi DazWilkin, thanks for your comment. As far as I understood, you can use the --gpu flag, when executing docker run ..., but how would you do this when running docker-compose up. According to the documentation of docker-compose up, there is no --gpu...Entreaty
Docker-Compose is effectively doing the docker run ... for you. You may provide arguments to a container in Compose using command: at the same level as image:, environment: etc. You would have command:. then below it - --gpu. NB That's a single hyphen to indicate an array item for command and then the double-hyphen preceeding gpu. Alternatively (but messy) you can mix JSON w/ the YAML and write: command: ["--gpu"]Elfreda
Hang on: github.com/docker/compose/issues/6691Elfreda
Hi DazWin, thanks for your comment. Unfortunately, your suggestion appears to be working for docker-compose versions 3.x (at least it did for 3.7), but not for version 2.3 which I think I am supposed to be using. So, I adjusted the command for the tensorflow as follows: command: ["/bin/sh -c", "--gpus all python", "import tensorflow as tf", "print(tf.reduce_sum(tf.random.normal([1000, 1000])))"]. Is this what you mean? Unfortunately, I cannot test this right now...Entreaty
Just to reiterate, while I'm familiar with Docker (Compose), I've not used it with Tensorflow. So, your mileage may vary :-) If you can bump to Compose 3.x, that would, of course be good (generally). I misread the DockerHub documentation, it appears "-gpus" is part of the image tag (not a flag, my apologies). So I think what you had before for the image is correct and then, omitting that from command should work: command: ["/bin/sh -c", "python", "import tensorflow as tf", "print(tf.reduce_sum(tf.random.normal([1000, 1000])))". You may be able to use command: ["python","-c","import ...."]Elfreda
Thank you very much, I will try that tomorrow!Entreaty
For docker-compose versio 2.3 I think you can use the runtime command. So runtime: nvidia, along with enviroment variables NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES This was removed in later docker-compose so in v3+ there seems to be a debate about how to support nvidia gpus.Hierarchy
S
8

I agree that installing all tensorflow-gpu dependencies is rather painful. Fortunately, it's rather easy with Docker, as you only need NVIDIA Driver and NVIDIA Container Toolkit (a sort of a plugin). The rest (CUDA, cuDNN) Tensorflow images have inside, so you don't need them on the Docker host.

The driver can be deployed as a container too, but I do not recommend that for a workstation. It is meant to be used on servers where there is no GUI (X-server, etc). The subject of containerized driver is covered at the end of this post, for now let's see how to start tensorflow-gpu with docker-compose. The process is the same regardless of whether you have the driver in container or not.

How to launch Tensorflow-GPU with docker-compose

Prerequisites:

To enable GPU support for a container you need to create the container with NVIDIA Container Toolkit. There are two ways you can do that:

  1. You can configure Docker to always use nvidia container runtime. It is fine to do so as it works just as the default runtime unless some NVIDIA-specific environment variables are present (more on that later). This is done by placing "default-runtime": "nvidia" into Docker's daemon.json:

/etc/docker/daemon.json:

{
  "runtimes": {
      "nvidia": {
          "path": "/usr/bin/nvidia-container-runtime",
          "runtimeArgs": []
      }
  },
  "default-runtime": "nvidia"
}
  1. You can select the runtime during container creation. With docker-compose it is only possible with format version 2.3.

Here is a sample docker-compose.yml to launch Tensorflow with GPU:

version: "2.3"  # the only version where 'runtime' option is supported

services:
  test:
    image: tensorflow/tensorflow:2.3.0-gpu
    # Make Docker create the container with NVIDIA Container Toolkit
    # You don't need it if you set 'nvidia' as the default runtime in
    # daemon.json.
    runtime: nvidia
    # the lines below are here just to test that TF can see GPUs
    entrypoint:
      - /usr/local/bin/python
      - -c
    command:
      - "import tensorflow as tf; tf.test.is_gpu_available(cuda_only=False, min_cuda_compute_capability=None)"

By running this with docker-compose up you should see a line with the GPU specs in it. It appears at the end and looks like this:

test_1 | 2021-01-23 11:02:46.500189: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/device:GPU:0 with 1624 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)

And that is all you need to launch an official Tensorflow image with GPU.

NVIDIA Environment Variables and custom images

As I mentioned, NVIDIA Container Toolkit works as the default runtime unless some variables are present. These are listed and explained here. You only need to care about them if you build a custom image and want to enable GPU support in it. Official Tensorflow images with GPU have them inherited from CUDA images they use a base, so you only need to start the image with the right runtime as in the example above.

If you are interested in customising a Tensorflow image, I wrote another post on that.

Host Configuration for NVIDIA driver in container

As mentioned in the beginning, this is not something you want on a workstation. The process require you to start the driver container when no other display driver is loaded (that is via SSH, for example). Furthermore, at the moment of writing only Ubuntu 16.04, Ubuntu 18.04 and Centos 7 were supported.

There is an official guide and below are extractions from it for Ubuntu 18.04.

  1. Edit 'root' option in NVIDIA Container Toolkit settings:
sudo sed -i 's/^#root/root/' /etc/nvidia-container-runtime/config.toml
  1. Disable the Nouveau driver modules:
sudo tee /etc/modules-load.d/ipmi.conf <<< "ipmi_msghandler" \
  && sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<< "blacklist nouveau" \
  && sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf <<< "options nouveau modeset=0"

If you are using an AWS kernel, ensure that the i2c_core kernel module is enabled:

sudo tee /etc/modules-load.d/ipmi.conf <<< "i2c_core"
  1. Update the initramfs:
sudo update-initramfs -u

Now it's time to reboot for the changes to take place. After reboot check that no nouveau or nvidia modules are loaded. The commands below should return nothing:

lsmod | grep nouveau
lsmod | grep nvidia

Starting driver in container

The guide offers a command to run the driver, I prefer docker-compose. Save the following as driver.yml:

version: "3.0"
services:
  driver:
    image: nvidia/driver:450.80.02-ubuntu18.04
    privileged: true
    restart: unless-stopped
    volumes:
    - /run/nvidia:/run/nvidia:shared
    - /var/log:/var/log
    pid: "host"
    container_name: nvidia-driver

Use docker-compose -f driver.yml up -d to start the driver container. It will take a couple of minutes to compile modules for your kernel. You may use docker logs nvidia-driver -f to overview the process, wait for 'Done, now waiting for signal' line to appear. Otherwise use lsmod | grep nvidia to see if the driver modules are loaded. When it's ready you should see something like this:

nvidia_modeset       1183744  0
nvidia_uvm            970752  0
nvidia              19722240  17 nvidia_uvm,nvidia_modeset
Slacken answered 23/1, 2021 at 11:17 Comment(0)
F
6

Docker Compose v1.27.0+

since 2022 version 3.x

version: "3.6"
services:

  jupyter-8888:
    image: "tensorflow/tensorflow:latest-gpu-jupyter"
    env_file: "env-file"
    deploy:
      resources:
        reservations:
          devices:
          - driver: "nvidia"
            device_ids: ["0"]
            capabilities: [gpu]
    ports:
      - 8880:8888
    volumes:
      - workspace:/workspace
      - data:/data

if you want to specify different GPU id eg. 0 and 3

device_ids: ['0', '3']
Featurelength answered 26/3, 2022 at 20:2 Comment(0)
B
1

Managed to get it working by installing WSL2 on my windows machine to to use VS Code along with the Remote-Containers extension. Here is a collection of articles that helped a lot with the installation of WSL2 and using VS Code from within it:

With the remote-containers extension of VS Code, you can then setup you devcontainer based on a docker-compose file (or just a Dockerfile as I did), which is probably better explained in the third link above. One thing for myself to remember is that when defining the .devcontainer.json file you need to make sure to set

// Optional arguments passed to ``docker run ... ``
    "runArgs": [
        "--gpus", "all"
    ]

Before VS Code, I have used Pycharm for a long time, so switching to VS Code was quite a pain at first, but VS Code along with WSL2, the remote-containers, and the pylance extension have made it quite easy to develop in a container with GPU support. As far as I know Pycharcm doesnt support debugging inside a container in WSL atm, because of

Baggy answered 22/1, 2021 at 19:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.