static openCL class not properly released in python module using boost.python
Asked Answered
C

2

13

EDIT: Ok, all the edits made the layout of the question a bit confusing so I will try to rewrite the question (not changing the content, but improving its structure).

The issue in short

I have an openCL program that works fine, if I compile it as an executable. Now I try to make it callable from Python using boost.python. However, as soon as I exit Python (after importing my module), python crashes.

The reason seems to have something to do with

statically storing only GPU CommandQueues and their release mechanism when the program terminates

MWE and setup

Setup

  • IDE used: Visual Studio 2015

  • OS used: Windows 7 64bit

  • Python version: 3.5

  • AMD OpenCL APP 3.0 headers

  • cl2.hpp directly from Khronos as suggested here: empty openCL program throws deprecation warning

  • Also I have an Intel CPU with integrated graphics hardware and no other dedicated graphics card

  • I use version 1.60 of the boost library compiled as 64-bit versions

  • The boost dll I use is called: boost_python-vc140-mt-1_60.dll

  • The openCL program without python works fine

  • The python module without openCL works fine

MWE

#include <vector>

#define CL_HPP_ENABLE_EXCEPTIONS
#define CL_HPP_TARGET_OPENCL_VERSION 200
#define CL_HPP_MINIMUM_OPENCL_VERSION 200 // I have the same issue for 100 and 110
#include "cl2.hpp"
#include <boost/python.hpp>

using namespace std;

class TestClass
{
private:
    std::vector<cl::CommandQueue> queues;
    TestClass();

public:
    static const TestClass& getInstance()
    {
        static TestClass instance;
        return instance;
    }
};

TestClass::TestClass()
{
    std::vector<cl::Device> devices;
    vector<cl::Platform> platforms;

    cl::Platform::get(&platforms);

    //remove non 2.0 platforms (as suggested by doqtor)
    platforms.erase(
        std::remove_if(platforms.begin(), platforms.end(),
            [](const cl::Platform& platform)
    {
        int v = cl::detail::getPlatformVersion(platform());
        short version_major = v >> 16;
        return !(version_major >= 2);
    }),
        platforms.end());

    //Get all available GPUs
    for (const cl::Platform& pl : platforms)
    {
        vector<cl::Device> plDevices;
        try {
            pl.getDevices(CL_DEVICE_TYPE_GPU, &plDevices);
        }
        catch (cl::Error&)
        {

            // Doesn't matter. No GPU is available on the current machine for 
            // this platform. Just check afterwards, that you have at least one
            // device
            continue;
        }       
        devices.insert(end(devices), begin(plDevices), end(plDevices));
    }

    cl::Context context(devices[0]);
    cl::CommandQueue queue(context, devices[0]);

    queues.push_back(queue);
}

int main()
{
    TestClass::getInstance();

    return 0;
}

BOOST_PYTHON_MODULE(FrameWork)
{
    TestClass::getInstance();
}

Calling program

So after compiling the program as a dll I start python and run the following program

import FrameWork
exit()

While the import works without issues, python crashes on exit(). So I click on debug and Visual Studio tells me there was an exception in the following code section (in cl2.hpp):

template <>
struct ReferenceHandler<cl_command_queue>
{
    static cl_int retain(cl_command_queue queue)
    { return ::clRetainCommandQueue(queue); }
    static cl_int release(cl_command_queue queue)  //  --  HERE  --
    { return ::clReleaseCommandQueue(queue); }
};

If you compile the above code instead as a simple executable, it works without issues. Also the code works if one of the following is true:

  • CL_DEVICE_TYPE_GPU is replaced by CL_DEVICE_TYPE_ALL

  • the line queues.push_back(queue) is removed

Question

So what could be the reason for this and what are possible solutions? I suspect it has something to do with the fact that my testclass is static, but since it works with the executable I am at a loss what is causing it.

Christopher answered 22/1, 2016 at 15:29 Comment(3)
Please study the posting guidelines, there is a bunch of info missing.Elkin
@UlrichEckhardt added a MWE. Does this help you?Christopher
I don't need help, you do. Your example probably isn't minimal, three files is a bunch. There's other info missing.Elkin
D
4

I came across similar problem in the past.

clRetain* functions are supported from OpenCL1.2. When getting devices for the first GPU platform (platforms[0].getDevices(...) for CL_DEVICE_TYPE_GPU) in your case it must happen to be a platform pre OpenCL1.2 hence you get a crash. When getting devices of any type (GPU/CPU/...) your first platform changes to be a OpenCL1.2+ and everything is fine.

To fix the problem set:

#define CL_HPP_MINIMUM_OPENCL_VERSION 110

This will ensure calls to clRetain* aren't made for unsupported platforms (pre OpenCL 1.2)


Update: I think there is a bug in cl2.hpp which despite setting minimum OpenCL version to 1.1 it still tries to use clRetain* on pre OpenCL1.2 devices when creating a command queue. Setting minimum OpenCL version to 110 and version filtering works fine for me.

Complete working example:

#include "stdafx.h"
#include <vector>

#define CL_HPP_ENABLE_EXCEPTIONS
#define CL_HPP_TARGET_OPENCL_VERSION 200
#define CL_HPP_MINIMUM_OPENCL_VERSION 110
#include <CL/cl2.hpp>

using namespace std;

class TestClass
{
private:
    std::vector<cl::CommandQueue> queues;
    TestClass();

public:
    static const TestClass& getInstance()
    {
        static TestClass instance;
        return instance;
    }
};

TestClass::TestClass()
{
    std::vector<cl::Device> devices;
    vector<cl::Platform> platforms;

    cl::Platform::get(&platforms);

    size_t x = 0;
    for (; x < platforms.size(); ++x)
    {
        cl::Platform &p = platforms[x];
        int v = cl::detail::getPlatformVersion(p());
        short version_major = v >> 16;
        if (version_major >= 2) // OpenCL 2.x
            break;
    }
    if (x == platforms.size())
        return; // no OpenCL 2.0 platform available

    platforms[x].getDevices(CL_DEVICE_TYPE_GPU, &devices); 
    cl::Context context(devices);
    cl::CommandQueue queue(context, devices[0]);

    queues.push_back(queue); 
}

int main()
{
    TestClass::getInstance();
    return 0;
}

Update2:

So what could be the reason for this and what are possible solutions? I suspect it has something to do with the fact that my testclass is static, but since it works with the executable I am at a loss what is causing it.

TestClass static seems to be a reason. Looks like releasing memory is happening in wrong order when run from python. To fix that you may want to add a method which will have to be explicitly called to release opencl objects before python starts releasing memory.

static TestClass& getInstance() // <- const removed
{
    static TestClass instance;
    return instance;
}

void release()
{
    queues.clear();
}

BOOST_PYTHON_MODULE(FrameWork)
{
    TestClass::getInstance();
    TestClass::getInstance().release();
}
Dihedron answered 26/1, 2016 at 19:22 Comment(8)
But all devices I have are openCL 2.0 and I even have the minimum version set to 200. Or do I misunderstand you?Christopher
Change CL_HPP_MINIMUM_OPENCL_VERSION to 110 or 100 and check.Dihedron
So I changed to CL_HPP_MINIMUM_OPENCL_VERSION 110 and 100 and both times python crashes when I enter exit(). Also in my real code I actually manually remove all platforms < open_cl 2.0 and also all devices < open_cl 2.0.Christopher
Thanks for the update. Unfortunately, this does not help. What you do is compile a standalone executable. This worked for me from the beginning. I am only getting issues, if I try to compile as a dll and import it into Python. The standalone executable works without errors for me. I will update my question, trying to make everything a bit more clear.Christopher
@MrZ Explicit release of CommandQueue objects should fix the python crash - see my latest update.Dihedron
Now I get a Microsoft Visual C++ Runtime Libraray error: "Runtime Error! Program: C:\Program Files\Python35\python.exe R6025 - pure virtual function call" when I enter exit() in python.Christopher
No, it is working. But I am still wondering why it does not work without this "hack".Christopher
Probably something which belongs to CommandQueue is freed before call to ::clReleaseCommandQueue. If there is more static objects inside python implementation then there can be an order of destruction issue.Dihedron
A
2

"I would appreciate an answer that explains to me what the problem actually is and if there are ways to fix it."

First, let me say that doqtor already answered how to fix the issue -- by ensuring a well-defined destruction time of all used OpenCL resources. IMO, this is not a "hack", but the right thing to do. Trying to rely on static init/cleanup magic to do the right thing -- and watching it fail to do so -- is the real hack!

Second, some thoughts about the issue: the actual problem is even more complex than the common static initialization order fiasco stories. It involves DLL loading/unloading order, both in connection with python loading your custom dll at runtime and (more importantly) with OpenCL's installable client driver (ICD) model.

What DLLs are involved when running an application/dll that uses OpenCL? To the application, the only relevant DLL is the opencl.dll you link against. It is loaded into process memory during application startup time (or when your custom DLL which needs opencl is dynamically loaded in python). Then, at the time when you first call clGetPlatformInfo() or similar in your code, the ICD logic kicks in: opencl.dll will look for installed drivers (in windows, those are mentioned somewhere in the registry) and dynamically load their respective dlls (using sth like the LoadLibrary() system call). That may be e.g. nvopencl.dll for nvidia, or some other dll for the intel driver you have installed. Now, in contrast to the relatively simple opencl.dll, this ICD dll can and will have a multitude of dependencies on its own -- probably using Intel IPP, or TBB, or whatever. So by now, things have become real messy already.

Now, during shutdown, the windows loader must decide which dlls to unload in which order. When you compile your example in a single executable, the number and order of dlls being loaded/unloaded will certainly be different than in the "python loads your custom dll at runtime" scenario. And that could well be the reason why you experience the problem only in the latter case, and only if you still have an opencl-context+commandqueue alive during shutdown of your custom dll. The destruction of your queue (triggered via clRelease... during static destruction of your testclass instance) is delegated to the intel-icd-dll, so this dll must still be fully functional at that time. If, for some reason, that is not the case (perhaps because the loader chose to unload it or one of the dlls it needs), you crash.

That line of thought reminded me of this article:

https://blogs.msdn.microsoft.com/larryosterman/2004/06/10/dll_process_detach-is-the-last-thing-my-dlls-going-to-see-right/

There's a paragraph, talking about "COM objects", which might be equally applicable to "OpenCL resources":

"So consider the case where you have a DLL that instantiates a COM object at some point during its lifetime. If that DLL keeps a reference to the COM object in a global variable, and doesn’t release the COM object until the DLL_PROCESS_DETACH, then the DLL that implements the COM object will be kept in memory during the lifetime of the COM object. Effectively the DLL implementing the COM object has become dependant on the DLL that holds the reference to the COM object. But the loader has no way of knowing about this dependency. All it knows is that the DLL’s are loaded into memory."


Now, I wrote a lot of words without coming to a definitive proof of what's actually going wrong. The main lesson I learned from bugs like these is: don't enter that snake pit, and do your resource-cleanup in a well-defined place like doqtor suggested. Good night.

Ambrosio answered 31/1, 2016 at 21:32 Comment(1)
Hm, having new and delete in your code nowadays is frowned upon and RAII ist preferred so that clean-up happens automagically when variables go out of scope, but I guess at the level of DLLs there is still the need for manual clean-up when unloading it until at some point in the future we may have a similar concept for linked libraries going out of scope.Christopher

© 2022 - 2024 — McMap. All rights reserved.