What is the relationship between NVIDIA MPS (Multi-Process Server) and CUDA Streams?

Glancing from the official NVIDIA Multi-Process Server docs, it is unclear to me how it interacts with CUDA streams.

Here's an example:

App 0: issues kernels to logical stream 0;

App 1: issues kernels to (its own) logical stream 0.

In this case,

1) Does / how does MPS "hijack" these CUDA calls? Does it have full knowledge of , for each application, what streams are used and what kernels are in which streams?

2) Does MPS create its own 2 streams, and place the respective kernels into the right streams? Or does MPS potentially enable kernel concurrency via mechanisms other than streams?

If it helps, I'm interested in how MPS work on Volta, but information with respect to older architecture is appreciated as well.

A way to think about MPS is that it acts as a funnel for CUDA activity, emanating from multiple processes, to take place on the GPU as if they emanated from a single process. One of the specific benefits of MPS is that it is theoretically possible for kernel concurrency even if the kernels emanate from separate processes. The "ordinary" CUDA multi-process execution model would serialize such kernel executions.

Since kernel concurrency in a single process implies that the kernels in question are issued to separate streams, it stands to reason that conceptually, MPS is treating the streams from the various client processes as being completely separate. Naturally, then, if you profile such a MPS setup, the streams will show up as being separate from each other, whether they are separate streams associated with a single client process, or streams across several client processes.

In the pre-Volta case, MPS did not guarantee process isolation between kernel activity from separate processes. In this respect, it was very much like a funnel, taking activity from several processes and issuing it to the GPU as if it were issued from a single process.

In the Volta case, activity from separate processes behaves from an execution standpoint (e.g. concurrency, etc.) as if it were from a single process, but activity from separate processes still carry process isolation (e.g. independent address spaces).

1) Does / how does MPS "hijack" these CUDA calls? Does it have full knowledge of , for each application, what streams are used and what kernels are in which streams?

Yes, CUDA MPS understands separate streams from a given process, as well as the activity issued to each, and maintains such stream semantics when issuing work to the GPU. The exact details of how CUDA calls are handled by MPS are unpublished, to my knowledge.

2) Does MPS create its own 2 streams, and place the respective kernels into the right streams? Or does MPS potentially enable kernel concurrency via mechanisms other than streams?

MPS maintains all stream activity, as well as CUDA stream semantics, across all clients. Activity issued into a particular CUDA stream will be serialized. Activity issued to independent streams may possibly run concurrently. This is true regardless of the origin of the streams in question, be they from one process or several.

Recommended topics

Hot tags