How to implement handles for a CUDA driver API library?

Asked 8/2, 2018 at 1:19 Answered 9/10, 2020 at 8:52

^{Note: The question has been updated to address the questions that have been raised in the comments, and to emphasize that the core of the question is about the interdependencies between the Runtime- and Driver API}

The CUDA runtime libraries (like CUBLAS or CUFFT) are generally using the concept of a "handle" that summarizes the state and context of such a library. The usage pattern is quite simple:

// Create a handle
cublasHandle_t handle;
cublasCreate(&handle);

// Call some functions, always passing in the handle as the first argument
cublasSscal(handle, ...);

// When done, destroy the handle
cublasDestroy(handle);

However, there are many subtle details about how these handles interoperate with Driver- and Runtime contexts and multiple threads and devices. The documentation lists several, scattered details about context handling:

The general description of contexts in the CUDA Programming Guide at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#context
The handling of multiple contexts, as described in the CUDA Best Practices Guide at http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#multiple-contexts
The context management differences between runtime and driver API, explained at http://docs.nvidia.com/cuda/cuda-driver-api/driver-vs-runtime-api.html
The general description of CUBLAS contexts/handles at http://docs.nvidia.com/cuda/cublas/index.html#cublas-context and their thread safety at http://docs.nvidia.com/cuda/cublas/index.html#thread-safety2

However, some of information seems to be not entirely up to date (for example, I think one should use cuCtxSetCurrent instead of cuCtxPushCurrent and cuCtxPopCurrent?), some of it seems to be from a time before the "Primary Context" handling was exposed via the driver API, and some parts are oversimplified in that they only show the most simple usage patterns, make only vague or incomplete statements about multithreading, or cannot be applied to the concept of "handles" that is used in the runtime libraries.

My goal is to implement a runtime library that offers its own "handle" type, and that allows usage patterns that are equivalent to the other runtime libraries in terms of context handling and thread safety.

For the case that the library can internally be implemented solely using the Runtime API, things may be clear: The context management is solely in the responsibility of the user. If he creates an own driver context, the rules that are stated in the documentation about the Runtime- and Driver context management will apply. Otherwise, the Runtime API functions will take care of the handling of primary contexts.

However, there may be the case that a library will internally have to use the Driver API. For example, in order to load PTX files as CUmodule objects, and obtain the CUfunction objects from them. And when the library should - for the user - behave like a Runtime library, but internally has to use the Driver API, some questions arise about how the context handling has to be implemented "under the hood".

What I have figured out so far is sketched here.

_{(It is "pseudocode" in that it omits the error checks and other details, and ... all this is supposed to be implemented in Java, but that should not be relevant here)}

1. The "Handle" is basically a class/struct containing the following information:

class Handle 
{
    CUcontext context;
    boolean usingPrimaryContext;
    CUdevice device;
}

2. When it is created, two cases have to be covered: It can be created when a driver context is current for the calling thread. In this case, it should use this context. Otherwise, it should use the primary context of the current (runtime) device:

Handle createHandle()
{
    cuInit(0);

    // Obtain the current context
    CUcontext context;
    cuCtxGetCurrent(&context);

    CUdevice device;

    // If there is no context, use the primary context
    boolean usingPrimaryContext = false;
    if (context == nullptr)
    {
        usingPrimaryContext = true;

        // Obtain the device that is currently selected via the runtime API
        int deviceIndex;
        cudaGetDevice(&deviceIndex);

        // Obtain the device and its primary context
        cuDeviceGet(&device, deviceIndex);
        cuDevicePrimaryCtxRetain(&context, device));
        cuCtxSetCurrent(context);
    }
    else
    {
        cuCtxGetDevice(device);
    }

    // Create the actual handle. This might internally allocate
    // memory or do other things that are specific for the context
    // for which the handle is created
    Handle handle = new Handle(device, context, usingPrimaryContext);
    return handle;
}

3. When invoking a kernel of the library, the context of the associated handle is made current for the calling thread:

void someLibraryFunction(Handle handle)
{
    cuCtxSetCurrent(handle.context);
    callMyKernel(...);
}

Here, one could argue that the caller is responsible for making sure that the required context is current. But if the handle was created for a primary context, then this context will be made current automatically.

4. When the handle is destroyed, this means that cuDevicePrimaryCtxRelease has to be called, but only when the context is a primary context:

void destroyHandle(Handle handle)
{
    if (handle.usingPrimaryContext)
    {
        cuDevicePrimaryCtxRelease(handle.device);
    }
}

From my experiments so far, this seems to expose the same behavior as a CUBLAS handle, for example. But my possibilities for thoroughly testing this are limited, because I only have a single device, and thus cannot test the crucial cases, e.g. of having two contexts, one for each of two devices.

So my questions are:

Are there any established patterns for implementing such a "Handle"?
Are there any usage patterns (e.g. with multiple devices and one context per device) that could not be covered with the approach that is sketched above, but would be covered with the "handle" implementations of CUBLAS?
More generally: Are there any recommendations of how to improve the current "Handle" implementation?
Rhetorical: Is the source code of the CUBLAS handle handling available somewhere?

(I also had a look at the context handling in tensorflow, but I'm not sure whether one can derive recommendations about how to implement handles for a runtime library from that...)

^{(An "Update" has been removed here, because it was added in response to the comments, and should no longer be relevant)}

Repro answered 8/2, 2018 at 1:19 Comment(18)

I believe the entire premise behind this question is incorrect. AFAIK, cublas (the example library in question) is a completely plain runtime API library which relies entirely on standard runtime API lazy context management behaviour. It never does any explicit context management itself, makes no attempt to do anything related to interoperability with the driver API, and the handle contains no context information whatsoever. See pastebin.com/8zHi1WgV – Mariellemariellen 14/2, 2018 at 11:45

@Mariellemariellen This may be true (and if it is true, the implementation might be far simpler than I thought). But I tried to cover different usage scenarios, and am pretty sure that there is a dependency between the (possibly existing) driver context and the CUBLAS handle. I have added an Update about this. Of course, one could observe this behavior, and justify it (e.g. by saying that creating a handle allocates memory, and it must be context specific). But this is, in my opinion, not stated clearly and unambiguously in the documentation... – Repro 14/2, 2018 at 13:5

@Mariellemariellen From the observations and your statement, it seems like one could roughly say that the handles behave similarly to plain memory: They are specific for the context that they are created on, and each operation that involves a handle and memory must make sure that the handle and the memory belong to the same context. (The fact that creating a handle also creates a context (if none existed) is then rather a matter of convenience for the Runtime API user, and not relevant for the further handling of the ... handle). – Repro 14/2, 2018 at 13:11

I'm not convinced your interpretation is correct. Don't confuse what the runtime API does by default with something being built into cublas. Any time a thread with an active CUDA context initializes the runtime API, the runtime API will bind onto that context. And what I linked to shows that all cublas will do is allocate memory in that context on the init call. If two threads start driver API contexts and then init cublas then the resulting handles will be different because they exist in different contexts – Mariellemariellen 15/2, 2018 at 7:31

So your second comment is closer to the mark. All of this is just default runtime API behaviour, no some elaborate logic built into the library – Mariellemariellen 15/2, 2018 at 7:33

@Mariellemariellen So then the part that is a bit vague is at least "...functions can be called from multiple host threads, even with the same handle." (from the last link in the list) : This only applies to the case where these host threads do not have their own Driver Context. Or more generally: Using a CUBLAS handle within a context that is not the one that it was created for will cause an error. Beyond that: Do you think that the sketched implementation (e.g. the handling of the primary context and the cuDevicePrimaryCtxRelease) are the right pattern for such a handle? – Repro 15/2, 2018 at 20:21

It is the correct deign pattern? Absolutely not. The correct design for any runtime API library is to perform no explicit context management itself. Ever. That is how all the "reference" NVIDIA libraries work and it is how yours should work too. The default runtime API behaviour (and its built in thread safety) covers 95% of typical use cases, the other 5% being where the programmer elects to manage contexts themselves via the driver API. In that case it is the programmer, not the library that must manage things correctly – Mariellemariellen 16/2, 2018 at 9:20

@Mariellemariellen Even though the context management should usually not be "visible" to the user of the runtime library, it seems like the library itself at least has to be aware of it. E.g. docs.nvidia.com/cuda/cuda-driver-api/driver-vs-runtime-api.html says: "if a context has been made current to the calling thread through the driver API, the runtime will use that, but if there is no such context, it uses a "primary context."". And it seems like e.g. a CUBLAS handle is specific for the (driver!) context that was current while it was created. (Am I overestimating the pitfalls here?) – Repro 16/2, 2018 at 10:0

No the library doesn't need to be aware of contexts at all. It is very simple -- if there is a context, the runtime API will automatically bind to it. If there isn't it will create a primary context, and other threads can bind to that context via the same API (this happens automatically in the runtime API). Anything created in a context it is, by default, not portable to other contexts. If you explicitly create two contexts via the driver API, then anything in those two contexts is not portable between them. This includes anything that happens via the runtime API within those contexts – Mariellemariellen 16/2, 2018 at 10:15

@Mariellemariellen (I start to feel a bit dull-witted, and will investigate this further when I'm back at my main dev PC but ...) Doesn't all this still mean that the runtime library has to increment/decrement the reference count of the (potentially used) primary context during the construction/destruction of the handle? E.g. is it wrong that at some point cuDevicePrimaryCtxRetain has to be called? – Repro 16/2, 2018 at 20:0

Nearly anything you create with a given context in CUDA would fail miserably if you tried to use it in another context. So creating an example of such does little (IMO) to advance your argument that special handling is required and there are pitfalls to be avoided. The relevant programming guide section is devoid of these concerns. So you can either take it at face value, or you can imagine dragons and declare that "the documentation is everything but clear about this". – Oribelle 18/2, 2018 at 3:2

@RobertCrovella The example that I added should only emphasize exactly that - namely, that a handle is specific for a (driver) context (which may not be surprising). A "pseudocode sketch" that shows what has to be done when a Handle is created/used/destroyed could be an acceptable answer, though. If it does not have to care about contexts, even better (but I hardly can imagine that). – Repro 18/2, 2018 at 3:44

@Marco13: Am I right in (belatedly) understanding that your question is actually something like "how should I design a handle for a drvier API based library which should work like CUBLAS or any other runtime API libary from a user point of view?" – Mariellemariellen 20/2, 2018 at 15:59

@Mariellemariellen Yes, that's basically it. I can only try to apologize for not making this clear enough in the first place, and in hindsight, I understand where much of the confusion came from: If somebody only uses the Runtime API in his library, then he may not have to care about all this. But I (erroneously and unjustifiedly) assumed that libraries would "usually, in some way" use the Driver API internally. I will (a bit later today) try to edit the question at some points to make this clearer. Sorry for the hassle. – Repro 20/2, 2018 at 17:17

@RobertCrovella and talonmies : I have updated the question in order to make clear that it is, at its core, about implementing Handles for a library that behaves like a Runtime library for the user, but internally has to use the Driver API. Sorry again for not making this clear right from the beginning. – Repro 20/2, 2018 at 22:25

@Mariellemariellen Thanks for the edit (and BTW, the comments, which helped to make the question clearer). I hope that the title does not distort the answers to being a pointer to docs.nvidia.com/cuda/cuda-c-programming-guide/… . But the fact that the library should "look and feel like" a Runtime library hopefully still becomes clear in the question. – Repro 3/3, 2018 at 14:8

@Marco13: We are now past the two year anniversary of this question. Did you even come to any conclusion or solution which could be added as an answer? This is the most highly upvoted unanswered question on the CUDA tag, so there clearly is demand for an answer.... – Mariellemariellen 22/7, 2020 at 9:43

@Mariellemariellen Yes, whenever I see an upvote, I think "I should tackle this", but I'm juggling with priorities - after all, this is only for a spare time project (the first approach was in github.com/jcuda/jcuda-vec/compare/… ). I just ordered a new PC for CUDA 11 and will do the update for JCuda 11 first, but will then increase the priority for this question and try to write a (not-too-Java/JCuda-specific) self-answer (maybe as a communityWiki - I'd appreciate your input there) – Repro 22/7, 2020 at 12:41

_{I'm sorry I hadn't noticed this question sooner - as we might have collaborated on this somewhat. Also, it's not quite clear to me whether this question belongs here, on codereview.SX or on programmers.SX, but let's ignore all that.}

I have now done what you were aiming to do, and possibly more generally. So, I can offer both an example of what to do with "handles", and moreover, suggest the prospect of not having to implement this at all.

The library is an expanding of cuda-api-wrappers to also cover the Driver API and NVRTC; it is not yet release-grade, but it is in the testing phase, on this branch.

Now, to answer your concrete question:

Pattern for writing a class surrounding a raw "handle"

Are there any established patterns for implementing such a "Handle"?

Yes. If you read:

What is the difference between: Handle, Pointer and Reference

you'll notice a handle is defined as an "opaque reference to an object". It has some similarity to a pointer. A relevant pattern, therefore, is a variation on the PIMPL idiom: In regular PIMPL, you write an implementation class, and the outwards-facing class only holds a pointer to the implementation class and forwards method calls to it. When you have an opaque handle to an opaque object in some third-party library or driver - you use the handle to forward method calls to that implementation.

That means, that your outwards-facing class is not a handle, it represents the object to which you have a handle.

Generality and flexibility

Are there any usage patterns (e.g. with multiple devices and one context per device) that could not be covered with the approach that is sketched above, but would be covered with the "handle" implementations of CUBLAS?

I'm not sure what exactly CUBLAS does under the hood (and I have almost never used CUBLAS to be honest), but if it were well-designed and implemented, it would create its own context, and try to not to impinge on the rest of your code, i.e. it would alwas do:

Push our CUBLAS context onto the top of the stack
Do actual work
Pop the top of the context stack.

Your class doesn't do this.

More generally: Are there any recommendations of how to improve the current "Handle" implementation?

Yes:

Use RAII whenever it is possible and relevant. If your creation code allocates a resource (e.g. via the CUDA driver) - the destructor for the object you return should safely release those resources.
Allow for both reference-type and value-type use of Handles, i.e. it may be the handle I created, but it may also be a handle I got from somewhere else and isn't my responsibility. This is trivial if you leave it up to the user to release resources, but a bit tricky if you take that responsibility
You assume that if there's any current context, that's the one your handle needs to use. Says who? At the very least, let the user pass a context in if they want to.
Avoid writing the low-level parts of this on your own unless you really must. You are quite likely to miss some things (the push-and-pop is not the only thing you might be missing), and you're repeating a lot of work that is actually generic and not specific to your application or library. I may be biased here, but you can now use nice, RAII-ish, wrappers for CUDA contexts, streams, modules, devices etc. without even known about raw handles for anything.

Rhetorical: Is the source code of the CUBLAS handle handling available somewhere?

To the best of my knowledge, NVIDIA hasn't released it.

Misdate answered 9/10, 2020 at 8:52 Comment(3)

Thanks for the answer. I might need some time to take a closer look (also at the actual repo), and refresh my memories about all the experiments that I did for this back when I asked it. It's not unlikely that I'll have some critical questions, mainly related to the combination of handles+threads+contexts (roughly things like: Can your handles be "transparently" used from different threads? Which contexts will they use? (Note: This does not seem to be possible with CUBLAS handles either, but e.g. the fact that cuCtxSetCurrent is not used in your repo at all confuses me a bit...)) – Repro 9/10, 2020 at 14:2

"Can your handles be "transparently" used from different threads?" Yes. "Which contexts will they use?" - whichever contexts you have them use. Context-specific wrappers get a context parameter on creation, and typically use push-if-necessary and pop-if-we-pushed. But if you use strictly-runtime-API entities without specifying or setting some special context, then runtime API behavior is imitated, i.e. it's usually some device's primary context that gets used. – Misdate 9/10, 2020 at 14:47

I wrote you a mail (and hope that's OK) - the comments may not be the best place to sort out the details, and as I mentioned, I might need some time to catch up here. – Repro 9/10, 2020 at 14:57

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Pattern for writing a class surrounding a raw "handle"

Generality and flexibility

Recommended topics

Hot tags