Docker xserver for NVIDIA opengl application (without X in host)
Asked Answered
C

2

5

I am trying to create an image of Docker that runs a X server using a NVIDIA GPU for OpenGL headless application. (Could be used creating textures, running Unity3D without screen, etc). In this case, the host does not run a X server, I want to do all inside the container.

I am using this Dockerfile for the image:

FROM ubuntu:18.04
    
ENV DEBIAN_FRONTEND=noninteractive
    
RUN apt update && \
        apt install -y \
        libglvnd0 \
        libgl1 \
        libglx0 \
        libegl1 \
        libgles2 \
        xserver-xorg-video-nvidia-440    
    
COPY xorg.conf.nvidia-headless /etc/X11/xorg.conf

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES graphics
ENV DISPLAY :1
    
    ENTRYPOINT ["/bin/bash"]

For the xorg.config.nvidia-headless I have created this with nvidia-xconfig

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"
EndSection

Section "Files"
EndSection

Section "Module"
    Load           "dbe"
    Load           "extmod"
    Load           "type1"
    Load           "freetype"
    Load           "glx"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "UseDisplayDevice" "None"
    SubSection     "Display"
        Virtual     1920 1080
        Depth       24
    EndSubSection
EndSection

I run docker with --privileged and with --gpus all using nvidia-docker and sharing the device --device --device=/dev/dri/card0. Inside Docker, I could run nvidia-smi perfectly. When I run the docker, I start a X server with

Xorg -noreset +extension GLX +extension RANDR +extension RENDER -logfile ./xserver.log vt1 :1

But it shows an error:

(EE) 
Fatal server error:
(EE) no screens found(EE) 
(EE) 

This is the complete log:

X.Org X Server 1.19.6
Release Date: 2017-12-20
[  1296.109] X Protocol Version 11, Revision 0
[  1296.109] Build Operating System: Linux 4.4.0-168-generic x86_64 Ubuntu
[  1296.109] Current Operating System: Linux ubuntu 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64
[  1296.109] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-112-generic root=UUID=8f2dc01d-1666-4abd-9bd1-cfe0a20afdf1 ro splash quiet vt.handoff=1
[  1296.109] Build Date: 14 November 2019  06:20:00PM
[  1296.109] xorg-server 2:1.19.6-1ubuntu4.4 (For technical support please see http://www.ubuntu.com/support) 
[  1296.109] Current version of pixman: 0.34.0
[  1296.109]    Before reporting problems, check http://wiki.x.org
    to make sure that you have the latest version.
[  1296.109] Markers: (--) probed, (**) from config file, (==) default setting,
    (++) from command line, (!!) notice, (II) informational,
    (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
[  1296.110] (++) Log file: "./xserver.log", Time: Wed Aug 19 08:38:46 2020
[  1296.110] (==) Using config file: "/etc/X11/xorg.conf"
[  1296.110] (==) Using system config directory "/usr/share/X11/xorg.conf.d"
[  1296.111] (==) ServerLayout "Layout0"
[  1296.111] (**) |-->Screen "Screen0" (0)
[  1296.111] (**) |   |-->Monitor "Monitor0"
[  1296.112] (**) |   |-->Device "Device0"
[  1296.112] (**) |-->Input Device "Keyboard0"
[  1296.112] (**) |-->Input Device "Mouse0"
[  1296.112] (==) Automatically adding devices
[  1296.112] (==) Automatically enabling devices
[  1296.112] (==) Automatically adding GPU devices
[  1296.112] (==) Automatically binding GPU devices
[  1296.112] (==) Max clients allowed: 256, resource mask: 0x1fffff
[  1296.114] (WW) The directory "/usr/share/fonts/X11/cyrillic" does not exist.
[  1296.114]    Entry deleted from font path.
[  1296.114] (WW) The directory "/usr/share/fonts/X11/100dpi/" does not exist.
[  1296.114]    Entry deleted from font path.
[  1296.114] (WW) The directory "/usr/share/fonts/X11/75dpi/" does not exist.
[  1296.114]    Entry deleted from font path.
[  1296.114] (WW) The directory "/usr/share/fonts/X11/Type1" does not exist.
[  1296.114]    Entry deleted from font path.
[  1296.114] (WW) The directory "/usr/share/fonts/X11/100dpi" does not exist.
[  1296.114]    Entry deleted from font path.
[  1296.114] (WW) The directory "/usr/share/fonts/X11/75dpi" does not exist.
[  1296.114]    Entry deleted from font path.
[  1296.114] (==) FontPath set to:
    /usr/share/fonts/X11/misc,
    built-ins
[  1296.114] (==) ModulePath set to "/usr/lib/xorg/modules"
[  1296.114] (WW) Hotplugging is on, devices using drivers 'kbd', 'mouse' or 'vmmouse' will be disabled.
[  1296.114] (WW) Disabling Keyboard0
[  1296.114] (WW) Disabling Mouse0
[  1296.115] (II) Loader magic: 0x55dca9edc020
[  1296.115] (II) Module ABI versions:
[  1296.115]    X.Org ANSI C Emulation: 0.4
[  1296.115]    X.Org Video Driver: 23.0
[  1296.115]    X.Org XInput driver : 24.1
[  1296.115]    X.Org Server Extension : 10.0
[  1296.116] (EE) dbus-core: error connecting to system bus: org.freedesktop.DBus.Error.FileNotFound (Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory)
[  1296.116] (++) using VT number 1

[  1296.116] (II) systemd-logind: logind integration requires -keeptty and -keeptty was not provided, disabling logind integration
[  1296.116] (II) xfree86: Adding drm device (/dev/dri/card0)
[  1296.119] (**) OutputClass "nvidia" ModulePath extended to "/usr/lib/x86_64-linux-gnu/nvidia/xorg,/usr/lib/xorg/modules"
[  1296.122] (--) PCI:*(0:1:0:0) 10de:100c:1043:84b7 rev 161, Mem @ 0xf9000000/16777216, 0xd0000000/134217728, 0xd8000000/33554432, I/O @ 0x0000e000/128, BIOS @ 0x????????/131072
[  1296.122] (II) LoadModule: "glx"
[  1296.123] (II) Loading /usr/lib/xorg/modules/extensions/libglx.so
[  1296.131] (EE) Failed to load /usr/lib/xorg/modules/extensions/libglx.so: /usr/lib/xorg/modules/extensions/libglx.so: undefined symbol: glxServer
[  1296.131] (II) UnloadModule: "glx"
[  1296.131] (II) Unloading glx
[  1296.131] (EE) Failed to load module "glx" (loader failed, 7)
[  1296.131] (II) LoadModule: "nvidia"
[  1296.131] (II) Loading /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so
[  1296.138] (II) Module nvidia: vendor="NVIDIA Corporation"
[  1296.139]    compiled for 1.6.99.901, module version = 1.0.0
[  1296.139]    Module class: X.Org Video Driver
[  1296.140] (II) NVIDIA dlloader X Driver  440.100  Fri May 29 08:21:27 UTC 2020
[  1296.140] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
[  1296.141] (II) Loading sub module "fb"
[  1296.141] (II) LoadModule: "fb"
[  1296.141] (II) Loading /usr/lib/xorg/modules/libfb.so
[  1296.143] (II) Module fb: vendor="X.Org Foundation"
[  1296.143]    compiled for 1.19.6, module version = 1.0.0
[  1296.143]    ABI class: X.Org ANSI C Emulation, version 0.4
[  1296.143] (II) Loading sub module "wfb"
[  1296.143] (II) LoadModule: "wfb"
[  1296.143] (II) Loading /usr/lib/xorg/modules/libwfb.so
[  1296.144] (II) Module wfb: vendor="X.Org Foundation"
[  1296.144]    compiled for 1.19.6, module version = 1.0.0
[  1296.144]    ABI class: X.Org ANSI C Emulation, version 0.4
[  1296.144] (II) Loading sub module "ramdac"
[  1296.144] (II) LoadModule: "ramdac"
[  1296.144] (II) Module "ramdac" already built-in
[  1296.145] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[  1296.145] (EE) NVIDIA:     system's kernel log for additional error messages and
[  1296.145] (EE) NVIDIA:     consult the NVIDIA README for details.
[  1296.145] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[  1296.145] (EE) NVIDIA:     system's kernel log for additional error messages and
[  1296.145] (EE) NVIDIA:     consult the NVIDIA README for details.
[  1296.145] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[  1296.145] (EE) NVIDIA:     system's kernel log for additional error messages and
[  1296.145] (EE) NVIDIA:     consult the NVIDIA README for details.
[  1296.145] (EE) No devices detected.
[  1296.145] (II) Applying OutputClass "nvidia" to /dev/dri/card0
[  1296.145]    loading driver: nvidia
[  1296.145] (==) Matched nvidia as autoconfigured driver 0
[  1296.145] (==) Matched nouveau as autoconfigured driver 1
[  1296.145] (==) Matched nouveau as autoconfigured driver 2
[  1296.145] (==) Matched modesetting as autoconfigured driver 3
[  1296.145] (==) Matched fbdev as autoconfigured driver 4
[  1296.145] (==) Matched vesa as autoconfigured driver 5
[  1296.145] (==) Assigned the driver to the xf86ConfigLayout
[  1296.145] (II) LoadModule: "nvidia"
[  1296.145] (II) Loading /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so
[  1296.145] (II) Module nvidia: vendor="NVIDIA Corporation"
[  1296.145]    compiled for 1.6.99.901, module version = 1.0.0
[  1296.145]    Module class: X.Org Video Driver
[  1296.145] (II) UnloadModule: "nvidia"
[  1296.145] (II) Unloading nvidia
[  1296.145] (II) Failed to load module "nvidia" (already loaded, 21980)
[  1296.145] (II) LoadModule: "nouveau"
[  1296.146] (WW) Warning, couldn't open module nouveau
[  1296.146] (II) UnloadModule: "nouveau"
[  1296.146] (II) Unloading nouveau
[  1296.146] (EE) Failed to load module "nouveau" (module does not exist, 0)
[  1296.146] (II) LoadModule: "modesetting"
[  1296.146] (II) Loading /usr/lib/xorg/modules/drivers/modesetting_drv.so
[  1296.147] (II) Module modesetting: vendor="X.Org Foundation"
[  1296.147]    compiled for 1.19.6, module version = 1.19.6
[  1296.147]    Module class: X.Org Video Driver
[  1296.147]    ABI class: X.Org Video Driver, version 23.0
[  1296.147] (II) LoadModule: "fbdev"
[  1296.147] (WW) Warning, couldn't open module fbdev
[  1296.147] (II) UnloadModule: "fbdev"
[  1296.147] (II) Unloading fbdev
[  1296.147] (EE) Failed to load module "fbdev" (module does not exist, 0)
[  1296.147] (II) LoadModule: "vesa"
[  1296.147] (WW) Warning, couldn't open module vesa
[  1296.147] (II) UnloadModule: "vesa"
[  1296.147] (II) Unloading vesa
[  1296.147] (EE) Failed to load module "vesa" (module does not exist, 0)
[  1296.147] (II) NVIDIA dlloader X Driver  440.100  Fri May 29 08:21:27 UTC 2020
[  1296.147] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
[  1296.147] (II) modesetting: Driver for Modesetting Kernel Drivers: kms
[  1296.147] (WW) xf86OpenConsole: setpgid failed: Operation not permitted
[  1296.147] (WW) xf86OpenConsole: setsid failed: Operation not permitted
[  1296.147] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[  1296.147] (EE) NVIDIA:     system's kernel log for additional error messages and
[  1296.147] (EE) NVIDIA:     consult the NVIDIA README for details.
[  1296.147] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[  1296.147] (EE) NVIDIA:     system's kernel log for additional error messages and
[  1296.147] (EE) NVIDIA:     consult the NVIDIA README for details.
[  1296.147] (WW) Falling back to old probe method for modesetting
[  1296.147] (EE) Screen 0 deleted because of no matching config section.
[  1296.147] (II) UnloadModule: "modesetting"
[  1296.147] (EE) Device(s) detected, but none match those in the config file.
[  1296.147] (EE) 
Fatal server error:
[  1296.147] (EE) no screens found(EE) 
[  1296.147] (EE) 
Please consult the The X.Org Foundation support 
     at http://wiki.x.org
 for help. 
[  1296.147] (EE) Please also check the log file at "./xserver.log" for additional information.
[  1296.147] (EE) 
[  1296.149] (EE) Server terminated with error (1). Closing log file.

Does anyone could help me with this? This will run on headless machine with a NVIDIA GPU.

Candlelight answered 19/8, 2020 at 8:43 Comment(2)
This may be relevant, specifically the EDID bit: serverfault.com/a/300550Noctambulism
Did you ever get this working? I am currently working on this. Thanks for posting the question.Thunderclap
L
5

Update June 20th 2023:

I discovered specifically for deploying on gcloud gke clusters with GPUs having a very minimal xorg.conf and pointing xorg to where the nvidia driver modules are located allows for the nvidia drivers to probe the hardware configuration of your displays and it works fine even without any! For gke clusters I made the following script

#!/bin/bash
GPU_PCI=$(nvidia-xconfig --query-gpu-info | grep BusID | awk '{print $4}')
cat <<EOT > /etc/X11/xorg.conf
Section "Files"
    ModulePath      "/usr/lib/xorg/modules"
    ModulePath      "/usr/local/nvidia"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "Tesla T4"
    BusID          "$GPU_PCI"
EndSection
EOT

Xorg -noreset +extension GLX +extension RANDR +extension RENDER -logfile ./xserver.log vt1 $DISPLAY &

I then call vxorg in my container init script and that generates the xorg.conf, starts my fake xserver, and allows opengl to fully use the GPU completely headless. We were only using Tesla T4 GPUs so it is hardcoded, but you can also make that dynamic with the nvidia-xconfig command as well (this command should be installed on the host only and its path mounted in the container).

The important bits are adding /usr/local/nvidia as a ModulePath that is where nvidia drivers are (this should be a mounted path from the host the nvidia drivers SHOULD NOT be installed in the container), specifying the GPU BusID and BoardName (I'm not sure why this is required, but without it I get the mentioned no screens found error), and leaving all of the screen and monitor info totally unspecified (this allows the nvidia driver to decide what's best with the available hardware).

This should work for containers running locally on linux machines as well as long as you mount your nvidia drivers and specify their path in the xorg.conf. I was unable to get this configuration working under WSL which is a shame because it makes a lot better use of the GPU. I was getting much higher FPS values from glxgears on gke than I did with the configuration below which does work on WSL.

Script for configuring a screen per GPU:

#!/bin/bash
set -eu
GPU_PCI=$( \
    PATH=${PATH-}:/usr/local/nvidia/bin \
    LD_LIBRARY_PATH=${LD_LIBRARY_PATH-}:/usr/local/nvidia/lib64 \
    nvidia-xconfig --enable-all-gpus --query-gpu-info | grep BusID | awk '{print $4}' \
)

if [ -z "$GPU_PCI" ]; then
    echo "Failed to get the BusID of an NVIDIA GPU, can't generate X.Org conf file xorg.conf" >&2
    exit 1
fi

cat <<EOT > /etc/X11/xorg.conf
# https://mcmap.net/q/1977269/-docker-xserver-for-nvidia-opengl-application-without-x-in-host#75115356

Section "Files"
    ModulePath      "/usr/lib/xorg/modules"
    ModulePath      "/usr/local/nvidia"
EndSection

Section "ServerLayout"
    Identifier     "Layout0"
EOT

counter=0
for pci in $GPU_PCI; do
cat <<EOT >> /etc/X11/xorg.conf
    Screen      $counter  "Screen$counter"
EOT
counter=$((counter + 1))
done

cat <<EOT >> /etc/X11/xorg.conf
EndSection
EOT

counter=0
for pci in $GPU_PCI; do
cat <<EOT >> /etc/X11/xorg.conf

Section "Device"
    Identifier      "Device$counter"
    Driver          "nvidia"
    BusID           "$pci"
EndSection

Section "Screen"
    Identifier      "Screen$counter"
    Device          "Device$counter"
EndSection
EOT
counter=$((counter + 1))
done

Original post (likely still relevant for WSL)

You're very close! My understanding is that you don't want the nvidia drivers inside of your container. You just want the container to use the drivers that the host system already has installed (you need to install the drivers on the host system if it doesn't have them!).

So instead of installing xserver-xorg-video-nvidia-440 install xserver-xorg-video-dummy in your Dockerfile. Then change your device section of xorg.conf to

Section "Device"
    Identifier     "Device0"
    Driver         "dummy"
EndSection

And because the dummy driver doesn't support virtual displays remove that line from your Display subsection and optionally set Modes instead

    SubSection     "Display"
        Depth       24
        Modes       "1920x1080"
    EndSubSection

Then the real magic occurs with the docker run command. You have to mount your host system's cuda libraries and set the linker path

docker run -v /usr/lib/wsl:/usr/lib/wsl -e LD_LIBRARY_PATH=/usr/lib/wsl/lib --device=/dev/dri/card0 --gpus all --rm -it <your-image-name>

I am using the NVIDIA GPU on my Windows 11 machine under WSL2 and see glorious GPU rendering with Open3d's OffscreenRenderer which as of yet doesn't support truly headless rendering with EGL.

# nvidia-smi
Fri Jan 13 18:05:59 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 527.92.01    Driver Version: 528.02       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   55C    P8     4W /  40W |    145MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A        21      G   /Xwayland                       N/A      |
|    0   N/A  N/A        21      G   /Xwayland                       N/A      |
|    0   N/A  N/A        23      G   /Xwayland                       N/A      |
|    0   N/A  N/A        29      G   /Xorg                           N/A      |
|    0   N/A  N/A        51    C+G   /python3.7                      N/A      |
+-----------------------------------------------------------------------------+

(The Xorg process is from the command you provided, python3.7 is my Open3d script using some OpenGL rendering enabled by said Xorg process, and the Xwayland processes are some WSL things that allow running Linux GUI applications)

I haven't tested this on Ubuntu but I believe the relevant path is /usr/local/nvidia/lib64 (at least that's what they are on GKE machines I use see: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus) I don't have an Ubuntu box with a GPU handy otherwise I would try to be more helpful.

Note if you're using more recent versions of mesa for your OpenGl under WSLg in windows you'll need to instruct Mesa to choose the GPU you want or if you only have one NVIDIA GPU you can just add

ENV MESA_D3D12_DEFAULT_ADAPTER_NAME NVIDIA

to your Dockerfile. See https://github.com/microsoft/wslg/wiki/GPU-selection-in-WSLg for more details.

Lap answered 14/1, 2023 at 2:23 Comment(1)
This was wildly useful. Thanks! I expanded the non-WSL2 example with some virtual resolution setting: github.com/nelsonjchen/docker-3d-accel-experiment/blob/main/…Schweinfurt
C
3

First things first: If you want headless OpenGL do not use an X server!

It's been years since a X server was required to to talk to the GPU. You can do headless rendering just fine without. Nvidia has a nice article on how to do it: https://developer.nvidia.com/blog/egl-eye-opengl-visualization-without-x-server/

The gist is, that you use EGL to set up a context and make the context current without a surface by calling eglMakeCurrent(eglDpy, EGL_NO_SURFACE, EGL_NO_SURFACE, eglCtx);.

You will still need the Nvidia driver for Xorg, since it also carries all the offscreen stuff, but there's an important caveat: The Nvidia userland driver must match the host systems nvidia kernel module version. If you wrap the driver up in a Docker container you're essentially tying that Docker image to the particular kernel module version on the host system. Not a desireable situation. Instead you should configure your docker image to bind the driver and OpenGL implementation libraries from the host system. Unfortunately there's no universal placement of where those libraries and drivers are to be found, which means that it takes a little bit more effort to pull them all in reliably. But despair not, Nvidia already did the work for you:

https://gitlab.com/nvidia/container-images/opengl

Also for setting up the off-screen context reliably it helps to unset the DISPLAY variable: Since Nvidia just built all their Vulkan and EGL stuff on top of the Xorg driver there are some codepaths that evaluate that variable and unsetting it helps nudging all the codepaths in the right direction. So inside your program, before setting up the OpenGL context do a setenv("DISPLAY", NULL, 0).

Circularize answered 19/8, 2020 at 8:58 Comment(2)
In this case, your are assuming that I am coding the OpenGL stuff in my program, which is not the situation. I want to use some released apps that uses OpenGL internally, nothing I can change it. Regarding to NVIDIA OpenGL image, it has essentially the same as in my Docker file for using the glvnd for redirecting the OpenGL calls to the NVIDIA drivers.Candlelight
@DTSED: Well, StackOverflow is about programming, hence the assumption. However if your question is not about programming, but about existing software, then StackOverflow is off-topic, then and should go to a different section of the StackExchange network. I'm close-marking your qeuestion for migration :-)Circularize

© 2022 - 2024 — McMap. All rights reserved.