Multiple CUDA contexts for one device - any sense?
Asked Answered
E

2

11

I thought I had the grasp of this but apparently I do not:) I need to perform parallel H.264 stream encoding with NVENC from frames that are not in any of the formats accepted by the encoder so I have a following code pipeline:

  • A callback informing that a new frame has arrived is called
  • I copy the frame to CUDA memory and perform the needed color space conversions (only the first cuMemcpy is synchronous, so I can return from the callback, all pending operations are pushed in a dedicated stream)
  • I push an event onto the stream and have another thread waiting for it, as soon as it is set I take the CUDA memory pointer with the frame in the correct color space and feed it to the decoder

For some reason I had the assumption that I need a dedicated context for each thread if I perform this pipeline in parallel threads. The code was slow and after some reading I understood that the context switching is actually expensive, and then I actually came to the conclusion that it makes no sense since in a context owns the whole GPU so I lock out any parallel processing from other transcoder threads.

Question 1: In this scenario am I good with using a single context and an explicit stream created on this context for each thread that performs the mentioned pipeline?

Question 2: Can someone enlighten me on what is the sole purpose of the CUDA device context? I assume it makes sense in a multiple GPU scenario, but are there any cases where I would want to create multiple contexts for one GPU?

Eichelberger answered 30/4, 2015 at 9:48 Comment(2)
what is NVCENC ? I have heard of NVENC, and NVCUVENC.Displode
@RobertCrovella, my bad, misspelled NVENC.Eichelberger
D
17

Question 1: In this scenario am I good with using a single context and an explicit stream created on this context for each thread that performs the mentioned pipeline?

You should be fine with a single context.

Question 2: Can someone enlighten me on what is the sole purpose of the CUDA device context? I assume it makes sense in a multiple GPU scenario, but are there any cases where I would want to create multiple contexts for one GPU?

The CUDA device context is discussed in the programming guide. It represents all of the state (memory map, allocations, kernel definitions, and other state-related information) associated with a particular process (i.e. associated with that particular process' use of a GPU). Separate processes will normally have separate contexts (as will separate devices), as these processes have independent GPU usage and independent memory maps.

If you have multi-process usage of a GPU, you will normally create multiple contexts on that GPU. As you've discovered, it's possible to create multiple contexts from a single process, but not usually necessary.

And yes, when you have multiple contexts, kernels launched in those contexts will require context switching to go from one kernel in one context to another kernel in another context. Those kernels cannot run concurrently.

CUDA runtime API usage manages contexts for you. You normally don't explicitly interact with a CUDA context when using the runtime API. However, in driver API usage, the context is explicitly created and managed.

Displode answered 30/4, 2015 at 12:19 Comment(3)
When you say that multiple contexts cannot run concurrently, is this limited to kernel launches only, or does it refer to memory transfers as well? I have been considering a multiprocess design all on the same GPU that uses the IPC API to transfer buffers from process to process. Does this mean that effectively, only one process at a time has exclusive access to the entire GPU (not just particular SMs)? That's not a killer for my design, but it is disappointing. How does that interplay with asynchronously-queued kernels/copies on streams in each process as far as scheduling goes?Regulus
Regarding your first question, I thought the second-to-last paragraph in my answer made it pretty clear that I was talking about concurrent kernels, but I've made a slight edit to remove doubt. Regarding the remainder, I suggest you pose a new question. It's not practical to delve into these topics in the space of comments.Displode
@RobertCrovella Given there's a single cuda context that ultimately needs to be synchronized across threads (cuvidCtxLock) does this mean we can never have a truly concurrent execution with NVENC APIs? Will it be faster if I loop across frames of multiple streams for encoding in a single thread itself ?Emmetropia
H
2

Obviously a few years have passed, but NVENC/NVDEC now appear to have CUstream support as of version 9.1 (circa September 2019) of the video codec SDK: https://developer.nvidia.com/nvidia-video-codec-sdk/download

NEW to 9.1- Encode: CUStream support in NVENC for enhanced parallelism between CUDA pre-processing and NVENC encoding

I'm super new to CUDA, but my basic understanding is that CUcontexts allow multiple processes to use the GPU (by doing context swaps that interrupt each other's work), while CUstreams allow for a coordinated sharing of the GPU's resources from within a single process.

Host answered 4/6, 2020 at 23:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.