How to implement device side CUDA virtual functions?
Asked Answered
L

1

13

I see that CUDA doesn't allow for classes with virtual functions to be passed into kernel functions. Are there any work-arounds to this limitation?

I would really like to be able to use polymorphism within a kernel function.

Thanks!

Leahy answered 8/11, 2014 at 2:2 Comment(3)
You can use polymorphism within a CUDA kernel function. The objects simply need to be created on the device. This generally shouldn't be that difficult to do, even if you need to initialize those objects with data originating from the host. I provided an answer here which demonstrates the concept with thrust, but of course it can be made to work in ordinary CUDA code as well.Walterwalters
@Robert Crovella It appears the source of this problem is the virtual function table's address being specific to the device. What effect would this have in a multi-gpu application where objects move between devices? Ex. I instantiate a polymorphic class on device 0, then memcpy the instance to device 1 (where there are also instances of the same class). Would that object break the memcopied object, or would it use device 1's virtual function table seamlessly?Leahy
I wouldn't expect it to work. The virtual function table is basically a set of pointers. Those pointers (addresses) are only going to be accurate for the device they are intended to be called on. In fact, I would expect UVA to pretty much guarantee that it would not work. You might be able to get it to work in a non-UVA setting, but I wouldn't count on it. However I'm just speculating here. I haven't tried it myself.Walterwalters
T
14

The most important part of Robert Crovella's comment is:

The objects simply need to be created on the device.

So keeping that in mind, I was dealing with situation where I had an abstract class Function and then some implementations of it encapsulating different function and its evaluation. This is the simplified version of my code how I achieved polymorphism in my situation, but I am not saying it cannot be done better... It will hopefully help you to get the idea:

class Function
{
public:
    __device__ Function() {}
    __device__ virtual ~Function() {}
    __device__ virtual void Evaluate(const real* __restrict__ positions, real* fitnesses, const SIZE_TYPE particlesCount) const = 0;
};

class FunctionRsj : public Function
{
private:
    SIZE_TYPE m_DimensionsCount;
    SIZE_TYPE m_PointsCount;
    real* m_Y;
    real* m_X;
public:
    __device__ FunctionRsj(const SIZE_TYPE dimensionsCount, const SIZE_TYPE pointsCount, real* configFileData)
        : m_DimensionsCount(dimensionsCount),
            m_PointsCount(pointsCount),
            m_Y(configFileData),
            m_X(configFileData + pointsCount) {}

    __device__ ~FunctionRsj()
    {
        // m_Y points to the beginning of the config
        // file data, use it for destruction as this 
        // object took ownership of configFilDeata.
        delete[] m_Y;
    }

    __device__ void Evaluate(const real* __restrict__ positions, real* fitnesses, const SIZE_TYPE particlesCount) const
    {
        // Implement evaluation of FunctionRsj here.
    }
};

__global__ void evaluate_fitnesses(
    const real* __restrict__ positions,
    real* fitnesses,
    Function const* const* __restrict__ function,
    const SIZE_TYPE particlesCount)
{
    // This whole kernel is just a proxy as kernels
    // cannot be member functions.
    (*function)->Evaluate(positions, fitnesses, particlesCount);
}

__global__ void create_function(
    Function** function,
    SIZE_TYPE dimensionsCount,
    SIZE_TYPE pointsCount,
    real* configFileData)
{
    // It is necessary to create object representing a function
    // directly in global memory of the GPU device for virtual
    // functions to work correctly, i.e. virtual function table
    // HAS to be on GPU as well.
    if (threadIdx.x == 0 && blockIdx.x == 0)
    {
        (*function) = new FunctionRsj(dimensionsCount, pointsCount, configFileData);
    }
}

__global__ void delete_function(Function** function)
{
    delete *function;
}

int main()
{
    // Lets just assume d_FunctionConfigData, d_Positions,
    // d_Fitnesses are arrays allocated on GPU already ...

    // Create function.
    Function** d_Function;
    cudaMalloc(&d_Function, sizeof(Function**));
    create_function<<<1, 1>>>(d_Function, 10, 10, d_FunctionConfigData);

    // Evaluate using proxy kernel.
    evaluate_fitnesses<<<
        m_Configuration.GetEvaluationGridSize(),
        m_Configuration.GetEvaluationBlockSize(),
        m_Configuration.GetEvaluationSharedMemorySize()>>>(
        d_Positions,
        d_Fitnesses,
        d_Function,
        m_Configuration.GetParticlesCount());

    // Delete function object on GPU.
    delete_function<<<1, 1>>>(d_Function);
}
Townspeople answered 8/11, 2014 at 10:19 Comment(3)
This is exactly what I was looking for. Thank you!! One follow up question: isn't calling new or malloc within a kernel or device function a huge performance hit?Leahy
@Leahy well in my case I havent experienced any significant overhead and all in all I couldnt imagine better way how to solve my problem so even if malloc or new would have hurt the performance, I would not care much. I made a comparison with simple approach without abstract class, creating function on CPU and then using the cudaMemcpy but the performance was almost the same. I guess implement the solution and optimize only if it is necessary.Townspeople
Does the constructor and destructor also has to be prefixed with __device__ keyword ? If so why it is necessary.Masterwork

© 2022 - 2024 — McMap. All rights reserved.