CUDA shared memory wrapped in templated class, points to same memory
Asked Answered
T

1

5

I am trying to allocate shared memory in a CUDA kernel within a templated class:

template<typename T, int Size>
struct SharedArray {

    __device__ T* operator()(){
        __shared__ T x[Size];
        return x;
    }
};

This works as long no shared memory with same type and size is retrieved twice. But when I try to get two times shared memory with same type and size, then the second shared memory points to the first one:

__global__
void test() {

    // Shared array
    SharedArray<int, 5> sharedArray;
    int* x0 = sharedArray();
    int* y0 = sharedArray();
    x0[0] = 1;
    y0[0] = 0;
    printf("%i %i\n\n", x0[0], y0[0]);
    // Prints:
    // 0 0
}

One solution is to add an id with each call to the shared memory class like:

template<int ID, typename T, int Size>
struct StaticSharedArrayWithID {

    __device__ static T* shared(){
        __shared__ T x[Size];
        return x;
    }   
};

But then I have to provide some counter which provides a very ugly user interface:

__global__
void test() {

    int& x1 = StaticSharedArrayWithID<__COUNTER__, int, 5>::shared();
    int& y1 = StaticSharedArrayWithID<__COUNTER__, int, 5>::shared();
    x1[0] = 1;
    y1[0] = 0;
    printf("%i %i\n\n", x1[0], y1[0]);
    // Prints:
    // 1 0
}

Does anyone has a idea to get rid of the __COUNTER__ macro in the user interface? It is okay when it is hidden.

Trondheim answered 26/10, 2015 at 12:13 Comment(7)
while not solving your above problem, you might be interested in having a look at bulk::malloc. It is described in this presentation.Justinajustine
you could simply use a wrapper macro which uses __COUNTER__ internally, is this something you are looking for?Justinajustine
@Justinajustine I am trying to not introduce a new macro, when it would be possible.Trondheim
May I ask you for your use case of this implementation? From the examples I cannot see a benefit over just writing __shared__ int x[5];.Harvell
The use case is the kernel abstraction library alpaka which provides shared memory also on kernels that are executed on CPUs or other accelerators. Thus threre is some level of abstraction necessaryTrondheim
I can't see the deleted answer, but as far as I recall it was about creating a __shared__ member variable, which is forbidden in CUDA? (see #12707978 )Scoot
I wonder if the code examples in the question are well-defined. I mean declaring a shared variable in a function and returning a pointer to the variable which goes out of scope. Is this documented somewhere? Is it worth to make this a separate question?Harvell
F
8

The reason for this is because __shared__ variables are static by default. Same instance of the same function refers to the same variable. The original reason for this behavior is because the compiler cannot deduct when the memory can be reclaimed. Having a variable static makes it live as long as the kernel.

A side effect is that if you have the same function called twice it two places in the program - you get the same result. In fact that is what you expect anyway when multiple CUDA threads call your function in the same spot, don't you?

There is no clean way to allocate shared memory dynamically. In my projects I did it through my own shared memory memory manager (ugly pointer arithmetic ahead, beware!):

typedef unsigned char byte;

/*
  Simple shared memory manager.
    With any luck if invoked with constant parameters this will not take up any register whatsoever
    Must be called uniformly by whole block which is going to use these
    sSize - amount of preallocated memory
*/
template <size_t sSize>
class SharedMemoryManager {
private:
    byte* shArray;
    byte* head;
public:

    __device__ SharedMemoryManager() {
        __shared__ byte arr[sSize];
        shArray=arr;
        head=arr;
    }

    __device__  void reset() {
        head=shArray;
    }

    __device__  byte* getHead() {return head;}
    __device__  void setHead(byte* newHead) {head=newHead;}

    template <typename T>
    __device__  T* alloc(size_t count) {
      size_t addr = head;
      size_t alignment = __alignof(T); //assuming alignment is power of 2
      addr = ((addr-1) | (alignment-1)) +1; //round up to match the alignment requirement
      head = (byte*)(addr);
      T* var = (T*)(head);
      head+=sizeof(T)*size;
      return allocAt<T>(head,count);
    }

template <typename T>
    __device__  T& alloc() {
      return *alloc<T>(1);
    }

};

You can use getHead/setHead to reclaim shared memory when you know it can be reclaimed, but only in a stack manner.

This approach should be easy to abstract over non-shared memory when CUDA is not your target.

Then you should be able to write:

__global__
 void test() {
   SharedMemoryManager shMem<1024>();

   int& xValue = shMem.alloc<int>();
   int& yValue = shMem.alloc<int>();
   int* xArray = shMem.alloc<int>(5);
   int* yArray = shMem.alloc<int>(5);

   xArray[0] = 1;
   yArray[0] = 0;
   printf("%i %i\n\n", xArray[0], yArray[0]);
   __syncthreads();
   shMem.reset(); //memory reclaimed
   ... 
}
Filia answered 27/10, 2015 at 19:9 Comment(2)
I guess one disadvantage here is that shared memory address could be treated as generic memory address, because the origin of pointer is not known (cannot be easily propagated by the compiler) to the caller.Branham
Since CUDA inlines (almost) everything (kernels really do not like holding a stack, because you need one for every thread) then the compiler is able to figure out the origin of the pointer. However, nowadays the memory space (global vs shared) is encoded in the pointer address itself too, so the compiler does not need to know this anymore. This feature was introduced in Kepler architectures I believe.Filia

© 2022 - 2024 — McMap. All rights reserved.