Dynamic Shared Memory in CUDA

There are similar questions to what I'm about to ask, but I feel like none of them get at the heart of what I'm really looking for. What I have now is a CUDA method that requires defining two arrays into shared memory. Now, the size of the arrays is given by a variable that is read into the program after the start of execution. Because of this, I cannot use that variable to define the size of the arrays, due to the fact that defining the size of shared arrays requires knowing the value at compile time. I do not want to do something like __shared__ double arr1[1000] because typing in the size by hand is useless to me as that will change depending on the input. In the same vein, I cannot use #define to create a constant for the size.

Now I can follow an example similar to what is in the manual such as

extern __shared__ float array[];
__device__ void func()      // __device__ or __global__ function
{
    short* array0 = (short*)array; 
    float* array1 = (float*)&array0[128];
    int*   array2 =   (int*)&array1[64];
}

But this still runs into an issue. From what I've read, defining a shared array always makes the memory address the first element. That means I need to make my second array shifted over by the size of the first array, as they appear to do in this example. But the size of the first array is dependent on user input.

Another question (Cuda Shared Memory array variable) has a similar issue, and they were told to create a single array that would act as the array for both arrays and simply adjust the indices to properly match the arrays. While this does seem to do what I want, it looks very messy. Is there any way around this so that I can still maintain two independent arrays, each with sizes that are defined as input by the user?

$ cat t501.cu #include <stdio.h> __global__ void my_kernel(unsigned arr1_sz, unsigned arr2_sz){ extern __shared__ char array[]; double *my_ddata = (double *)array; char *my_cdata = arr1_sz*sizeof(double) + array; for (int i = 0; i < arr1_sz; i++) my_ddata[i] = (double) i*1.1f; for (int i = 0; i < arr2_sz; i++) my_cdata[i] = (char) i; printf("at offset %d, arr1: %lf, arr2: %d\n", 10, my_ddata[10], (int)my_cdata[10]); } int main(){ unsigned double_array_size = 256; unsigned char_array_size = 128; unsigned shared_mem_size = (double_array_size*sizeof(double)) + (char_array_size*sizeof(char)); my_kernel<<<1,1, shared_mem_size>>>(256, 128); cudaDeviceSynchronize(); return 0; } $ nvcc -arch=sm_20 -o t501 t501.cu $ cuda-memcheck ./t501 ========= CUDA-MEMCHECK at offset 10, arr1: 11.000000, arr2: 10 ========= ERROR SUMMARY: 0 errors $

Recommended topics

Hot tags