CUDA streams and context
Asked Answered
J

1

5

I am using an application presently that spawns a bunch of pthreads (linux), and each of those creates it's own CUDA context. (using cuda 3.2 right now).

The problem I am having is that it seems like each thread having its own context costs a lot of memory on the GPU. Something like 200MB per thread, so this is really limiting me.

Can I simply create streams in the host thread, pass the stream reference to the worker threads, which would then be able to pass to my CUDA library their stream number, and all work out of the same context?

Does a worker thread automatically know the same CUDA context as it's parent thread?

Thanks

Juba answered 25/7, 2011 at 18:8 Comment(0)
S
12

Each CUDA context does cost quite a bit of device memory, and their resources are strictly partitioned from one another. For example, device memory allocated in context A cannot be accessed by context B. Streams also are valid only in the context in which they were created.

The best practice would be to create one CUDA context per device. By default, that CUDA context can be accessed only from the CPU thread that created it. If you want to access the CUDA context from other threads, call cuCtxPopCurrent() to pop it from the thread that created it. The context then can be pushed onto any other CPU thread's current context stack, and subsequent CUDA calls would reference that context.

Context push/pop are lightweight operations and as of CUDA 3.2, they can be done in CUDA runtime apps. So my suggestion would be to initialize the CUDA context, then call cuCtxPopCurrent() to make the context "floating" unless some threads wants to operate it. Consider the "floating" state to be the natural one - whenever a thread wants to manipulate the context, bracket its usage with cuCtxPushCurrent()/cuCtxPopCurrent().

Seaworthy answered 26/7, 2011 at 13:11 Comment(4)
does a PopCurrent() actually remove it from the "queue" that holds the context, so that it is not accessible to the other threads? Can I pop the current context from the host thread, and just pass that context to the worker threads, and "push" it onto the context stack? Sounds like the context would have to be in a "concurrent queue" and have a mutex, right?Juba
When a context is created, it is pushed onto a current-context stack. Popping the context causes it to become unavailable to any CPU thread until it has been pushed onto another current-context stack with cuCtxPushCurrent(). So the workflow you describe is exactly what the API was designed to enable. The contexts are thread-safe, so the only additional thread synchronization you need to implement would be to enforce an ordering or other semantics as needed by your application.Seaworthy
I'm using cuCtxSetCurrent ( CUcontext ctx ) for each thread. Does this need something to let it go, similarly to that pop after the push, when thread doesn't need context anymore?Lillalillard
When push/pop were first implemented, only one CPU thread at a time could have the context current. cuCtxSetCurrent() was added later, after NVIDIA fixed CUDA so a context can be current to more than one thread at a time. So, I expect your code will work fine.Seaworthy

© 2022 - 2024 — McMap. All rights reserved.