This question is a followup on Jason R's comment to Robert Crovellas answer on this original question ("Multiple CUDA contexts for one device - any sense?"):
When you say that multiple contexts cannot run concurrently, is this limited to kernel launches only, or does it refer to memory transfers as well? I have been considering a multiprocess design all on the same GPU that uses the IPC API to transfer buffers from process to process. Does this mean that effectively, only one process at a time has exclusive access to the entire GPU (not just particular SMs)? [...] How does that interplay with asynchronously-queued kernels/copies on streams in each process as far as scheduling goes?
Robert Crovella suggested asking this in a new question but it never happed, so let me do this here.