I am using an application presently that spawns a bunch of pthreads (linux), and each of those creates it's own CUDA context. (using cuda 3.2 right now).
The problem I am having is that it seems like each thread having its own context costs a lot of memory on the GPU. Something like 200MB per thread, so this is really limiting me.
Can I simply create streams in the host thread, pass the stream reference to the worker threads, which would then be able to pass to my CUDA library their stream number, and all work out of the same context?
Does a worker thread automatically know the same CUDA context as it's parent thread?
Thanks