Managing multiple GPUs with multiple users
Asked Answered
H

1

9

I have a server (Ubuntu 16.04) with 4 GPUs. My team shares this, and our current approach is to containerize all of our work with Docker, and to restrict containers to GPUs using something like $ NV_GPU=0 nvidia-docker run -ti nvidia/cuda nvidia-smi. This works well when we're all very clear about who's using which GPU, but our team has grown and I'd like a more robust way of monitoring GPU use and prohibit access to GPUs when they're in use. nvidia-smi is one channel of information with the "GPU-Util", but sometimes the GPU may have a 0% GPU-Util at one moment while it is currently reserved by someone working in a container.

Do you have any recommendations for:

  1. Tracking when a user runs $ NV_GPU='gpu_id' nvidia-docker run
  2. Kicking an error when another user runs $ NV_GPU='same_gpu_id' nvidia-docker run
  3. Keeping an updated log that's something along the lines of {'gpu0':'user_name or free', . . ., 'gpu3':'user_name or free'}, where for every gpu it identifies the user who ran the active docker container utilizing that gpu, or it states that it is 'free'. Actually, stating the user and the container that is linked to the gpu would be preferable.
  4. Updating the log when the user closes the container that is utilizing the gpu

I may be thinking about this the wrong way too, so open to other ideas. Thanks!

Hercules answered 14/6, 2017 at 14:40 Comment(0)
D
1

Sounds like a great place to apply CI/CD practises. What you need is a job queue. Each user may request to use the resources (=GPUs) by triggering the pipeline in some way e.g. pushing a commit on a specific branch. Then, an automatic system will allocate the shared resources in an ordered manner and everybody will eventually get their experiments done.

This is probably the most scalable way to do this. Much more than reservation calendars or ad hoc usage. The only way that is more scalable is to buy compute from cloud but that is not in the scope of OPs question.

Darelldarelle answered 21/3, 2023 at 9:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.