CUDA document does not specific how many CUDA process can share one GPU. For example, if I launch more than one CUDA programs by the same user with only one GPU card installed in the system, what is the effect? Will it guarantee the correctness of execution? How does the GPU schedule tasks in this case?
CUDA activity from independent host processes will normally create independent CUDA contexts, one for each process. Thus, the CUDA activity launched from separate host processes will take place in separate CUDA contexts, on the same device.
CUDA activity in separate contexts will be serialized. The GPU will execute the activity from one process, and when that activity is idle, it can and will context-switch to another context to complete the CUDA activity launched from the other process. The detailed inter-context scheduling behavior is not specified. (Running multiple contexts on a single GPU also cannot normally violate basic GPU limits, such as memory availability for device allocations.) Note that the inter-context switching/scheduling behavior is unspecified and may also vary depending on machine setup. Casual observation or micro-benchmarking may suggest that kernels from separate processes on newer devices can run concurrently (outside of MPS) but this is not correct. Newer machine setups may have a time-sliced rather than round-robin behavior, but this does not change the fact that at any given instant in time, code from only one context can run.
The "exception" to this case (serialization of GPU activity from independent host processes) would be the CUDA Multi-Process Server. In a nutshell, the MPS acts as a "funnel" to collect CUDA activity emanating from several host processes, and run that activity as if it emanated from a single host process. The principal benefit is to avoid the serialization of kernels which might otherwise be able to run concurrently. The canonical use-case would be for launching multiple MPI ranks that all intend to use a single GPU resource.
Note that the above description applies to GPUs which are in the "Default" compute mode. GPUs in "Exclusive Process" or "Exclusive Thread" compute modes will reject any attempts to create more than one process/context on a single device. In one of these modes, attempts by other processes to use a device already in use will result in a CUDA API reported failure. The compute mode is modifiable in some cases using the nvidia-smi utility.
I am new in this topic. But I have found that it is possible to simulate multiple GPUs on one GPU only. "Developing for multiple GPUs will allow a model to scale with the additional resources. If developing on a system with a single GPU, we can simulate multiple GPUs with virtual devices. This enables easy testing of multi-GPU setups without requiring additional resources."
Source: https://www.tensorflow.org/guide/gpu#allowing_gpu_memory_growth
Maybe using this technique, we can run each model on one of these virtual GPUs (at least for inference).
You did not specify your use case, but it is trivial to just write different routines and select at runtime what to do:
__device__ void RunProgram1() {
printf("T:%i B:%i program1\n", threadIdx.x, blockIdx.x);
}
__device__ void RunProgram2() {
printf("T:%i B:%i program2\n", threadIdx.x, blockIdx.x);
}
__global__ void StartDifferentTasks() {
TaskID = blockIdx.x % 2; //assign blocks 'equally' to each task
//you can partition this as needed
switch (TaskID) {
case 0: RunProgram1(); break;
case 1: RunProgram2(); break;
default: assert(false);
}
}
int main() {
StartDifferentTasks<<<4, 32>>>(); //4 blocks, 2 programs
cudaDeviceSynchronize(); //todo: add error checking
}
© 2022 - 2024 — McMap. All rights reserved.