I want to understand how a cuda context is created and associated with a kernel in cuda runtime API applications?
I know it is done under the hood by driver APIs. But I would like to understand the timeline of the creation.
For a start I know cudaRegisterFatBinary is the first cuda api call made and it registers a fatbin file with the runtime. It is followed by a handful of cuda function registration APIs which call cuModuleLoad in the driver layer. But then if my Cuda runtime API application invokes cudaMalloc how is the pointer provided to this function associated with the context, which I believe should have been created beforehand. How does one get a handle to this already created context and associate the future runtime API calls with it? Please demystify the internal workings.
To quote NVIDIA's documentation on this
CUDA Runtime API calls operate on the CUDA Driver API CUcontext which is bound to the current host thread.
If there exists no CUDA Driver API CUcontext bound to the current thread at the time of a CUDA Runtime API call which requires a CUcontext then the CUDA Runtime will implicitly create a new CUcontext before executing the call.
If the CUDA Runtime creates a CUcontext then the CUcontext will be created using the parameters specified by the CUDA Runtime API functions cudaSetDevice, cudaSetValidDevices, cudaSetDeviceFlags, cudaGLSetGLDevice, cudaD3D9SetDirect3DDevice, cudaD3D10SetDirect3DDevice, and cudaD3D11SetDirect3DDevice. Note that these functions will fail with cudaErrorSetOnActiveProcess if they are called when a CUcontext is bound to the current host thread.
The lifetime of a CUcontext is managed by a reference counting mechanism. The reference count of a CUcontext is initially set to 0, and is incremented by cuCtxAttach and decremented by cuCtxDetach.
If a CUcontext is created by the CUDA Runtime, then the CUDA runtime will decrement the reference count of that CUcontext in the function cudaThreadExit. If a CUcontext is created by the CUDA Driver API (or is created by a separate instance of the CUDA Runtime API library), then the CUDA Runtime will not increment or decrement the reference count of that CUcontext.
All CUDA Runtime API state (e.g, global variables' addresses and values) travels with its underlying CUcontext. In particular, if a CUcontext is moved from one thread to another (using cuCtxPopCurrent and cuCtxPushCurrent) then all CUDA Runtime API state will move to that thread as well.
But what I don't understand is how does cuda runtime create the context? what API calls are used for this? Does the nvcc compiler insert some API calls to do this at compile time or is this done entirely at runtime? If the former is true what run time APIs are used for this context management? It the later is true how exactly is it done ?
If a context is associated with a host thread, how do we get access to this context? Is it automatically associated with all the variables and pointer references dealt with by the thread?
how ultimately is a module loading done in the context?