cuda context creation and resource association in runtime API applications

I want to understand how a cuda context is created and associated with a kernel in cuda runtime API applications?

I know it is done under the hood by driver APIs. But I would like to understand the timeline of the creation.

For a start I know cudaRegisterFatBinary is the first cuda api call made and it registers a fatbin file with the runtime. It is followed by a handful of cuda function registration APIs which call cuModuleLoad in the driver layer. But then if my Cuda runtime API application invokes cudaMalloc how is the pointer provided to this function associated with the context, which I believe should have been created beforehand. How does one get a handle to this already created context and associate the future runtime API calls with it? Please demystify the internal workings.

To quote NVIDIA's documentation on this

CUDA Runtime API calls operate on the CUDA Driver API CUcontext which is bound to the current host thread.

If there exists no CUDA Driver API CUcontext bound to the current thread at the time of a CUDA Runtime API call which requires a CUcontext then the CUDA Runtime will implicitly create a new CUcontext before executing the call.

If the CUDA Runtime creates a CUcontext then the CUcontext will be created using the parameters specified by the CUDA Runtime API functions cudaSetDevice, cudaSetValidDevices, cudaSetDeviceFlags, cudaGLSetGLDevice, cudaD3D9SetDirect3DDevice, cudaD3D10SetDirect3DDevice, and cudaD3D11SetDirect3DDevice. Note that these functions will fail with cudaErrorSetOnActiveProcess if they are called when a CUcontext is bound to the current host thread.

The lifetime of a CUcontext is managed by a reference counting mechanism. The reference count of a CUcontext is initially set to 0, and is incremented by cuCtxAttach and decremented by cuCtxDetach.

If a CUcontext is created by the CUDA Runtime, then the CUDA runtime will decrement the reference count of that CUcontext in the function cudaThreadExit. If a CUcontext is created by the CUDA Driver API (or is created by a separate instance of the CUDA Runtime API library), then the CUDA Runtime will not increment or decrement the reference count of that CUcontext.

All CUDA Runtime API state (e.g, global variables' addresses and values) travels with its underlying CUcontext. In particular, if a CUcontext is moved from one thread to another (using cuCtxPopCurrent and cuCtxPushCurrent) then all CUDA Runtime API state will move to that thread as well.

But what I don't understand is how does cuda runtime create the context? what API calls are used for this? Does the nvcc compiler insert some API calls to do this at compile time or is this done entirely at runtime? If the former is true what run time APIs are used for this context management? It the later is true how exactly is it done ?

If a context is associated with a host thread, how do we get access to this context? Is it automatically associated with all the variables and pointer references dealt with by the thread?

how ultimately is a module loading done in the context?

The CUDA runtime maintains a global list of modules to load, and adds to that list every time a DLL or .so that uses the CUDA runtime is loaded into the process. But the modules are not actually loaded until a device is created.

Context creation and initialization is done "lazily" by the CUDA runtime -- every time you call a function like cudaMemcpy(), it checks to see whether CUDA has been initialized, and if it hasn't, it creates a context (on the device previously specified by cudaSetDevice(), or the default device if cudaSetDevice() was never called) and loads all the modules. The context is associated with that CPU thread from then on, until it's changed by cudaSetDevice().

You can use context/thread management functions from the driver API, such as cuCtxPopCurrent()/cuCtxPushCurrent(), to use the context from a different thread.

You can call cudaFree(0); to force this lazy initialization to occur.

I'd strongly advise doing so at application initialization time, to avoid race conditions and undefined behavior. Go ahead and enumerate and initialize the devices as early as possible in your app; once that is done, in CUDA 4.0 you can call cudaSetDevice() from any CPU thread and it will select the corresponding context that was created by your initialization code.

Recommended topics

Hot tags