Templated CUDA kernel with dynamic shared memory
Asked Answered
C

2

12

I want to call different instantiations of a templated CUDA kernel with dynamically allocated shared memory in one program. My first naive approach was to write:

template<typename T>
__global__ void kernel(T* ptr)
{
  extern __shared__ T smem[];
  // calculations here ...                                                                                                                                          
}

template<typename T>
void call_kernel( T* ptr, const int n )
{
  dim3 dimBlock(n), dimGrid;
  kernel<<<dimGrid, dimBlock, n*sizeof(T)>>>(ptr);
}

int main(int argc, char *argv[])
{
  const int n = 32;
  float *float_ptr;
  double *double_ptr;
  cudaMalloc( (void**)&float_ptr, n*sizeof(float) );
  cudaMalloc( (void**)&double_ptr, n*sizeof(double) );

  call_kernel( float_ptr, n );
  call_kernel( double_ptr, n ); // problem, 2nd instantiation

  cudaFree( (void*)float_ptr );
  cudaFree( (void*)double_ptr );
  return 0;
}

However, this code cannot be compiled. nvcc gives me the following error message:

main.cu(4): error: declaration is incompatible with previous "smem"
(4): here
          detected during:
            instantiation of "void kernel(T *) [with T=double]"
(12): here
            instantiation of "void call_kernel(T *, int) [with T=double]"
(24): here

I understand that I am running into a name conflict because the shared memory is declared as extern. Nevertheless there is no way around that if I want to define its size during runtime, as far as I know.

So, my question is: Is there any elegant way to obtain the desired behavior? With elegant I mean without code duplication etc.

Chari answered 19/12, 2014 at 16:58 Comment(1)
Possibly an oversight in the CUDA compiler, as this is otherwise allowed in C++ (without the __shared__ qualifier).Uralite
H
19

Dynamically allocated shared memory is really just a size (in bytes) and a pointer being set up for the kernel. So something like this should work:

replace this:

extern __shared__ T smem[];

with this:

extern __shared__ __align__(sizeof(T)) unsigned char my_smem[];
T *smem = reinterpret_cast<T *>(my_smem);

You can see other examples of re-casting of dynamically allocated shared memory pointers in the programming guide which can serve other needs.

EDIT: updated my answer to reflect the comment by @njuffa.

Heard answered 19/12, 2014 at 17:12 Comment(10)
Conservatively, shouldn't one take possible pointer alignment issues into question? I usually declare my_smem[] to be of type double2 to ensure 16-byte alignment, then cast the pointer to type T.Allodium
This raises the question of whether the pointer created by the dynamic shared memory allocation process that is passed to the threadblock will be 16-byte aligned (or have any alignment) or not. It seems likely to me that it will, but since I don't know that it is specified, I agree that your way seems better. Certainly global memory allocations have a defined alignment that exceeds the alignment of any vector type, even. Modified my answer.Heard
Based on the CUDA documentation, I am not aware of any alignment guarantees which is why I have always used the double2 approach for a conservative approach. Use of the __align__ attribute should also work, of course, and is arguably cleaner.Allodium
I don't think this will work, because the alignment specification is different for different instantiations of the template. Also, I think it's unnecessary, since (the start of) shared memory should be very well aligned. Also @njuffa, what do you think?Milly
@Milly I have already said everything I wanted to say: My method is to declare the dynamic shared memory to be of type double2 to force 16-byte alignment, then cast the double2 pointer to a pointer of whatever T the program needs. It is not clear what "this" refers to when you say "I don't think this will work"; you would need to be more specific.Allodium
@njuffa: "this" = RobertCrovella's code, which is not the same as your suggestion. Also, are we not certain that the shared memory will be very well aligned (i.e. 16 KiB-aligned or something)? Plus, I've added my own variation as an answer.Milly
As I expressed previously (comment above), I am not certain that dynamic shared memory comes with any alignment guarantees other than that alignment is suitable for the type used in the declaration. If you can point to a specific place in the CUDA documentation that talks about alignment guarantees for dynamic shared memory, that could improve the answer.Allodium
@RobertCrovella: What do you feel about my suggestion as opposed to the user casting the pointer themselves?Milly
I don't think we need a sidebar discussion; you've already posted an answer, so readers have choices. You've stated that "this won't work". According to my testing, what I've posted here works but issues warnings (when multiple different templated sizes are used, e.g. with multiple instantions of the templated kernel for different sizes). And the way to get rid of the warning is to drop the use of the align directive which your answer doesn't include.Heard
So if we agree to drop the align directive, my answer works without even warnings. Your most recent statement includes "as opposed to the user casting the pointer themselves." If you think that doing that in a templated library is an important distinction, then we have your answer. Not sure there is much to discuss here. And the format here doesn't encourage sidebar discussions. So I'm unlikely to respond to further inquiries of that type.Heard
M
7

(A variation on @RobertCrovella's answer)

NVCC is not willing to accept two extern __shared__ arrays of the same name but different types - even if they're never in each other's scope. We'll need to satisfy NVCC by having our template instances all use the same type for the shared memory under the hood, while letting the kernel code using them see the type it likes.

So we replace this instruction:

extern __shared__ T smem[];

with this one:

auto smem = shared_memory_proxy<T>();

where:

template <typename T>
__device__ T* shared_memory_proxy()
{
    // do we need an __align__() here? I don't think so...
    extern __shared__ unsigned char memory[];
    return reinterpret_cast<T*>(memory);
}

is in some device-side code include file.

Advantages:

  • One-liner at the site of use.
  • Simpler syntax to remember.
  • Separation of concerns - whoever reads the kernel doesn't have to think about why s/he's seeing extern, or alignment specifiers, or a reinterpret cast etc.

Notes:

  • This is implemented as part of my CUDA kernel author's tools header-only library: shared_memory.cuh (where it's named shared_memory::dynamic::proxy() ).
  • I have not explored the question of alignment, when you use both dynamic and static shared memory.
Milly answered 11/3, 2018 at 20:5 Comment(4)
I know this is a really old answer, but I'm still curious. You comment ` do we need an __align__() here? I don't think so...` in your last snippet. Assuming this shared memory is going to be used to store anything that is not unsigned char, isn't __align__ explicitly required to avoid undefined behaviour? Adding it triggers ` warning #1886-D: specified alignment (xxx) is different from alignment (1)` when using this function in various template instantiations...Trainee
@Erel: no, not necessarily, since this isn't some arbitrarily allocated slab of host or device memory: It's the block's shared memory. So far I've never had alignment issues with shared memory. But since I'm speculating rather than relying on guarantees I thought I should have the comment. That warning message is a bit suspicious; consider asking a question about it and linking to my answer - and maybe even filing a bug report about it.Milly
Note that your answer does not solve the alignment issue: in the answer itself, you just dismiss it as "I don't think we need to care about alignment", and in your library, you say "Do we need this alignment? [of 1024 bytes] Probably not". However, if you combine static and dynamic shared memory, you'll end up with much more shared memory required than actually needed with an alignment of 1024.Sardonic
@MattKo: Have you read the two comments above yours? Anyway, added a comment below the code snippet to clarify I'm ignoring alignment.Milly

© 2022 - 2024 — McMap. All rights reserved.