I have a server (Ubuntu 16.04) with 4 GPUs. My team shares this, and our current approach is to containerize all of our work with Docker, and to restrict containers to GPUs using something like $ NV_GPU=0 nvidia-docker run -ti nvidia/cuda nvidia-smi
. This works well when we're all very clear about who's using which GPU, but our team has grown and I'd like a more robust way of monitoring GPU use and prohibit access to GPUs when they're in use. nvidia-smi
is one channel of information with the "GPU-Util", but sometimes the GPU may have a 0% GPU-Util at one moment while it is currently reserved by someone working in a container.
Do you have any recommendations for:
- Tracking when a user runs
$ NV_GPU='gpu_id' nvidia-docker run
- Kicking an error when another user runs
$ NV_GPU='same_gpu_id' nvidia-docker run
- Keeping an updated log that's something along the lines of {'gpu0':'user_name or free', . . ., 'gpu3':'user_name or free'}, where for every gpu it identifies the user who ran the active docker container utilizing that gpu, or it states that it is 'free'. Actually, stating the user and the container that is linked to the gpu would be preferable.
- Updating the log when the user closes the container that is utilizing the gpu
I may be thinking about this the wrong way too, so open to other ideas. Thanks!