How does CUDA assign device IDs to GPUs?

B

5

45

When a computer has multiple CUDA-capable GPUs, each GPU is assigned a device ID. By default, CUDA kernels execute on device ID 0. You can use cudaSetDevice(int device) to select a different device.

Let's say I have two GPUs in my machine: a GTX 480 and a GTX 670. How does CUDA decide which GPU is device ID 0 and which GPU is device ID 1?

Ideas for how CUDA might assign device IDs (just brainstorming):

descending order of compute capability
PCI slot number
date/time when the device was added to system (device that was just added to computer is higher ID number)

Motivation: I'm working on some HPC algorithms, and I'm benchmarking and autotuning them for several GPUs. My processor has enough PCIe lanes to drive cudaMemcpys to 3 GPUs at full bandwidth. So, instead of constantly swapping GPUs in and out of my machine, I'm planning to just keep 3 GPUs in my computer. I'd like to be able to predict what will happen when I add or replace some GPUs in the computer.

Biddle answered 8/12, 2012 at 20:42 Comment(0)

B

29

CUDA picks the fastest device as device 0. So when you swap GPUs in and out the ordering might change completely. It might be better to pick GPUs based on their PCI bus id using:

cudaError_t cudaDeviceGetByPCIBusId ( int* device, char* pciBusId )
   Returns a handle to a compute device.

cudaError_t cudaDeviceGetPCIBusId ( char* pciBusId, int  len, int  device )
   Returns a PCI Bus Id string for the device.

or CUDA Driver API cuDeviceGetByPCIBusId cuDeviceGetPCIBusId.

But IMO the most reliable way to know which device is which would be to use NVML or nvidia-smi to get each device's unique identifier (UUID) using nvmlDeviceGetUUID and then match it do CUDA device with pciBusId using nvmlDeviceGetPciInfo.

Beggarweed answered 9/12, 2012 at 8:21 Comment(6)

By "fastest" do you mean in terms of clock speed? – Biddle 9/12, 2012 at 10:6

Some heuristics are used to estimate the theoretical speed of the GPU. They take into account e.g. chip architecture, clock speed, driver model (on windows TCC is preffered). – Beggarweed 9/12, 2012 at 16:8

At the moment, I have 3 CUDA-capable GPUs in my machine: a GTX680, a GTX9800 (an ancient, slow GPU that I just use for graphics), and a C2050. Oddly, the GTX9800 gets a lower number than the C2050... strange. – Biddle 26/12, 2012 at 5:53

Only GPU with index 0 is the fastest. Rest of indexes are not sorted by speed. Does GTX 9800 has index 0? If not then everything is working as expected. – Beggarweed 26/12, 2012 at 7:43

Nope, the GTX9800 doesn't have index 0. It makes more sense now. – Biddle 26/12, 2012 at 7:57

In CUDA 8, there is an environment variable which allows you to modify the enumeration order of the CUDA runtime API. – Floe 1/4, 2017 at 0:24

M

47

Set the environment variable CUDA_DEVICE_ORDER as:

export CUDA_DEVICE_ORDER=PCI_BUS_ID

Then the GPU IDs will be ordered by pci bus IDs.

Mena answered 31/3, 2017 at 2:47 Comment(3)

With this set, the CUDA device id's are consistent with nvidia-smi's output! IMO this is a must-set for machine learning on a multi-gpu machine. – Landwehr 27/7, 2017 at 0:51

This did not work in my setup, I am on a user without admin rights on a Jupyter Notebook server, and I run export ... at the beginning of the cell. I might have to run that with admin rights when I set up the server instead. But it works if I set it with %set_env CUDA_DEVICE_ORDER=PCI_BUS_ID or with import os and then os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID", see the other answer. Strangely, for CUDA_VISIBLE_DEVICES, I must run !export CUDA_VISIBLE_DEVICES='1,4' and not %set_env or os.environ(). – Spar 31/1, 2024 at 16:14

This did not work in my setup, I am on a user without admin rights on a Jupyter Notebook server, and I run !export ... at the beginning of the cell. I might have to run that with admin rights when I set up the server instead. But it works if I set it with %set_env CUDA_DEVICE_ORDER=PCI_BUS_ID or with import os and then os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID", see the other answer. Strangely, for CUDA_VISIBLE_DEVICES, I must run !export CUDA_VISIBLE_DEVICES='1,4' and not %set_env or os.environ(). – Spar 31/1, 2024 at 16:36

B

29

CUDA picks the fastest device as device 0. So when you swap GPUs in and out the ordering might change completely. It might be better to pick GPUs based on their PCI bus id using:

cudaError_t cudaDeviceGetByPCIBusId ( int* device, char* pciBusId )
   Returns a handle to a compute device.

cudaError_t cudaDeviceGetPCIBusId ( char* pciBusId, int  len, int  device )
   Returns a PCI Bus Id string for the device.

or CUDA Driver API cuDeviceGetByPCIBusId cuDeviceGetPCIBusId.

But IMO the most reliable way to know which device is which would be to use NVML or nvidia-smi to get each device's unique identifier (UUID) using nvmlDeviceGetUUID and then match it do CUDA device with pciBusId using nvmlDeviceGetPciInfo.

Beggarweed answered 9/12, 2012 at 8:21 Comment(6)

By "fastest" do you mean in terms of clock speed? – Biddle 9/12, 2012 at 10:6

Some heuristics are used to estimate the theoretical speed of the GPU. They take into account e.g. chip architecture, clock speed, driver model (on windows TCC is preffered). – Beggarweed 9/12, 2012 at 16:8

At the moment, I have 3 CUDA-capable GPUs in my machine: a GTX680, a GTX9800 (an ancient, slow GPU that I just use for graphics), and a C2050. Oddly, the GTX9800 gets a lower number than the C2050... strange. – Biddle 26/12, 2012 at 5:53

Only GPU with index 0 is the fastest. Rest of indexes are not sorted by speed. Does GTX 9800 has index 0? If not then everything is working as expected. – Beggarweed 26/12, 2012 at 7:43

Nope, the GTX9800 doesn't have index 0. It makes more sense now. – Biddle 26/12, 2012 at 7:57

In CUDA 8, there is an environment variable which allows you to modify the enumeration order of the CUDA runtime API. – Floe 1/4, 2017 at 0:24

C

9

The best solution I have found (tested in tensorflow==2.3.0) is to add the following before anything that may import tensorflow:

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0,3"  # specify which GPU(s) to be used

This way, the order that TensorFlow orders the GPUs will be the same as that reported by tools such as nvidia-smi or nvtop.

Counterman answered 22/9, 2020 at 11:18 Comment(3)

How does this in any way explain what order CUDA enumerates devices in, which is the question? – Wideranging 22/9, 2020 at 12:28

Because the OP asked for "I'd like to be able to predict what will happen when I add or replace some GPUs in the computer" and my answer accomplishes just that. – Counterman 11/3, 2021 at 10:15

I am on a user without admin rights on a Jupyter Notebook server, and export ... at the beginning of the cell does not work. But it works if I set it with %set_env CUDA_DEVICE_ORDER=PCI_BUS_ID or with import os and then os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID". Strangely, for CUDA_VISIBLE_DEVICES, I must run !export CUDA_VISIBLE_DEVICES='1,4' and not %set_env or os.environ(). – Spar 31/1, 2024 at 16:13

B

5

The CUDA Support/Choosing a GPU suggest that

when running a CUDA program on a machine with multiple GPUs, by default CUDA kernels will execute on whichever GPU is installed in the primary graphics card slot.

Also, the discussion at No GPU selected, code working properly, how's this possible? suggests that CUDA does not map the "best" card to device 0 in general.

EDIT

Today I have installed a PC with a Tesla C2050 card for computation and a 8084 GS card for visualization switching their position between the first two PCI-E slots. I have used deviceQuery and noticed that GPU 0 is always that in the first PCI slot and GPU 1 is always that in the second PCI slot. I do not know if this is a general statement, but it is a proof that for my system GPUs are numbered not according to their "power", but according to their positions.

Breana answered 9/9, 2013 at 10:36 Comment(1)

I agree. I've had cases where a machine has a modern GTX6xx Kepler and an ancient G80, and device 0 is the G80. The opposite has happened to me too. The "order of PCIe slots" explanation sounds reasonable. I haven't paid much attention to the PCIe slot order that I used, other than trying to reserve PCIe_3 slots for PCIe_3-compatible GPUs. – Biddle 23/9, 2013 at 2:40

U

-1

os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' set's the order which device is on cuda:0 and cuda:1 ... cuda:n for example:

os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
print(f'CUDA device count: {torch.cuda.device_count()}')
print(f'CUDA device name: {torch.cuda.get_device_name("cuda:0")}')
print(f'CUDA device name: {torch.cuda.get_device_name("cuda:1")}')

CUDA device count: 2
CUDA device name: NVIDIA GeForce RTX 4080
CUDA device name: NVIDIA GeForce RTX 3090

OR

os.environ['CUDA_VISIBLE_DEVICES'] = '1,0'
print(f'CUDA device count: {torch.cuda.device_count()}')
print(f'CUDA device name: {torch.cuda.get_device_name("cuda:0")}')
print(f'CUDA device name: {torch.cuda.get_device_name("cuda:1")}')

CUDA device count: 2
CUDA device name: NVIDIA GeForce RTX 3090
CUDA device name: NVIDIA GeForce RTX 4080

Unteach answered 21/2, 2024 at 20:46 Comment(0)

Recommended topics

Hot tags