Cuda Shared Memory array variable
Asked Answered
C

3

18

I am trying to declare a variable for matrix multiplication as follows:

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

I am trying to make it so the user could input the size of the matrix to calculate, however that would mean changing the BLOCK_SIZE. I changed it but I am getting a compiler error:

error: constant value is not known

I've looked into it and it's similar to this thread. So I tried:

__shared__ int buf [];

But then I get:

error: incomplete type is not allowed

Thanks, Dan

Update with code (pretty much followed this guide and the staring out with CUDA guide):
The block size is passed in by asking the user of the size of the matrix. They enter the x and y. Block size is only x and right now it has to accept the same size as x and y.

__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB,size_t block_size)
{
    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;
    
    // Thread index
    int tx = threadIdx.x;
    int ty = threadIdx.y;
    
    // Index of the first sub-matrix of A processed 
    // by the block
    int aBegin = wA * block_size * by;
 
    // Index of the last sub-matrix of A processed 
    // by the block
    int aEnd   = aBegin + wA - 1;
 
    // Step size used to iterate through the 
    // sub-matrices of A
    int aStep  = block_size;
 
    // Index of the first sub-matrix of B processed 
    // by the block
    int bBegin = block_size * bx;
 
    // Step size used to iterate through the 
    // sub-matrices of B
    int bStep  = block_size * wB;
    float Csub=0;
    // Loop over all the sub-matrices of A and B
    // required to compute the block sub-matrix
    for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
    {
        // Declaration of the shared memory array As 
        // used to store the sub-matrix of A
        
        extern __shared__ float As[];

        // Declaration of the shared memory array Bs 
        // used to store the sub-matrix of B
        extern __shared__ float Bs[];
        extern __shared__ float smem[];

        // Load the matrices from global memory
        // to shared memory; each thread loads
        // one element of each matrix
        smem[ty*block_size+tx] = A[a + wA * ty + tx];
        //cuPrintf("\n\nWhat are the memory locations?\n");
        //cuPrintf("The shared memory(A) is: %.2f\n",smem[ty*block_size+tx]);
        smem[block_size*block_size+ty*block_size+tx]  = B[b + wB * ty + tx];
        //cuPrintf("The shared memory(B) is: %.2f\n",smem[block_size*block_size+ty*block_size+tx]);
        // Synchronize to make sure the matrices 
        // are loaded
        __syncthreads();
 
        // Multiply the two matrices together;
        // each thread computes one element
        // of the block sub-matrix
        for (int k = 0; k < block_size; ++k)
        {
        
            Csub += smem[ty*block_size+k] * smem[block_size*block_size+k*block_size+tx] ;
            //cuPrintf("Csub is currently: %.2f\n",Csub);
        }
        //cuPrintf("\n\n\n");
        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        //cuPrintf("the results are csub: %.2f\n",Csub);
        __syncthreads();
    }
    // Write the block sub-matrix to device memory;
    // each thread writes one element
    int c = wB * block_size * by + block_size * bx;
    C[c + wB * ty + tx] = Csub;
    
    
}
Carrell answered 8/2, 2012 at 4:36 Comment(0)
S
35
extern __shared__ int buf[];

when you launch the kernel you should launch it this way;

kernel<<<blocks,threads,numbytes_for_shared>>>(...);

If you have multiple extern declaration of shared:

extern __shared__ float As[];
// ...
extern __shared__ float Bs[];

this will lead to As pointing to the same address as Bs.

You will need to keep As and Bs inside the 1D-array.

extern __shared__ float smem[];

When calling kernel, you should launch it with 2*BLOCK_SIZE*BLOCK_SIZE*sizeof(float).

When indexing into As, use smem[y*BLOCK_SIZE+x] and when indexing into Bs use smem[BLOCK_SIZE*BLOCK_SIZE+y*BLOCK_SIZE+x]

Scene answered 8/2, 2012 at 9:50 Comment(8)
Thanks for the information. I'm just having a little bit of issue with multiplying the matrix now, but will attempt that with the information given here.Carrell
How do I handle it in the for loop where A and B are being accessed with K? This is what I have: for (int k = 0; k < block_size; ++k)Csub += smem[tyblock_size+tx] *smem[block_sizeblock_size+ty*block_size+tx] ;But all it's doing is doubling my numbers. The original was Csub += As[ty][k] * Bs[k][tx];Carrell
Csub += smem[tyblock_size+k]*smem[block_sizeblock_size+k*block_size+tx]Scene
@Dan: there is a much simpler and more efficient way to do this. I will edit my answer to include an example illustrating how it can be done.Goring
@Scene this works for up to 4x4 matrix then once I go past that Matrix C results are all messed up. values are all 0s or mixed with 0s.Carrell
@Scene I've updated my function code to reflect the changes I've done.Carrell
@Dan: there is no problem with the kernel. I suspect that you are launching the kernel in a wrong way. The matrix dimensions need to be a multiple of block_size. Also i had a mistake in the size of share memory. It should be nr of bytes (2*block_sizeblock_sizesizeof(float)).Scene
@Scene thank you that solved my issue. It created another one but I am looking into it. I may post another question regarding that one. Thanks for you and talonmies for helping me.Carrell
G
32

You have two choices for declaring shared memory inside a kernel - static or dynamic. I presume what you are doing at the moment looks something like this:

#define BLOCK_SIZE (16)

__global__ void sgemm0(const float *A, const float *B, float *C)
{
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

}

and you would like to be able to easily change BLOCK_SIZE.

One possibility is to continue to use static shared memory allocation, but make the allocation size a template parameter, like this:

template<int blocksize=16>
__global__ void sgemm1(const float *A, const float *B, float *C)
{
    __shared__ float As[blocksize][blocksize];

}
template void sgemm1<16>(const float *, const float *, float *C);

Then you can instantiate as many different block size variants at compile time as you need.

If you want to dynamically allocate the memory, define it like this:

__global__ void sgemm2(const float *A, const float *B, float *C)
{
    extern __shared__ float As[];

} 

and then add the size of the allocation as an argument to the kernel call:

size_t blocksize = BLOCK_SIZE * BLOCK_SIZE;
sgemm2<<< gridDim, blockDim, sizeof(float)*blocksize >>>(....);

If you have multiple statically declared arrays which you wish to replace with dynamically allocated shared memory, then be aware that there is only ever one dynamic shared memory allocation per kernel, so multiple items exits within (share) that memory segment. So if you had something like:

#define BLOCK_SIZE (16)

__global__ void sgemm0(const float *A, const float *B, float *C)
{
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

}

you could replace it with:

#define BLOCK_SIZE (16)

__global__ void sgemm3(const float *A, const float *B, float *C)
{
    extern __shared__ float buffer[];

    float *As = &buffer[0];
    float *Bs = &buffer[BLOCK_SIZE*BLOCK_SIZE];

}

and launch the kernel like this:

size_t blocksize = 2 * BLOCK_SIZE * BLOCK_SIZE;
sgemm3<<< gridDim, blockDim, sizeof(float)*blocksize >>>(....);

All are equally valid, although I personally favour the template version because it can allow other compiler optimisation like automatic loop unrolling that the dynamic version cannot without extra work.

Goring answered 8/2, 2012 at 9:42 Comment(8)
@talonmies, that __shared__ float *As; should be extern __shared__ float As[]; as in brano's answer. You both get an upvote. :)Pled
Bah, missed that one. Thanks Mark.Goring
Ok thank you for the answers. I did the extern shared. However it converts it to a 1d array, the program was orginally using 2d arrays. I understand that 2d arrays are an arrays of arrays so 1d arrays should work. For example: As[ty][tx] = A[a + wA * ty + tx]; I converted to -> As[ty*MAX_THREADS+tx] = A[a + wA * ty + tx]; where maxthreads is 1023, since my max threads are 1024. But I am getting -0,0,-0,0 for a 2x2 matrix.Carrell
MAX_THREADS should be MAX_THREADS_IN_X_DIR. Look at your example above. In your case you had and array of BLOCK_SIZEBLOCK_SIZE. In order to index into one dimensional array you should use A[yBLOCK_SIZE+x].Scene
I tried that as well. I am actually getting results, however they are incorrect. I am updating my orginial post to reflect the current code I am using.Carrell
@talonmies, I've tried this method prior however I keep getting this error when I compile: matrixMul_kernel.cu(58): error: initialization with "{...}" expected for aggregate objectCarrell
@Dan: I can't help with an anonymous compile error in code I can't see. Clearly that is not coming from the code you have posted in this question. My suggestion is to accept an answer here and write a new question which clearly explains what you are trying to do. This question now has almost nothing to do with what you originally asked about.Goring
The default template parameter approach never occurred to me. I'd like to say "Awesome!" and "Thanks!" but there's no audio.Shirr
F
0

Sounds correct.

Generally in this case you'll need to malloc something.

There are two things here, one C doesn't know about 2D arrays (it's just an array of arrays) and array sizes need to compile time constants (or something the compiler can calculate at compile time).

If you are using C99 you can declare the array size using a parameter of the function, but C99 support is... spotty at best.

Floury answered 8/2, 2012 at 8:26 Comment(5)
I've tried mallocing but I don't believe you are allowed to do so on the device code.Carrell
Can't call malloc on device code ... All dynamic memory has to be allocated before you enter the kernel, and the dynamic buffer need to be allocated and copied to the device using CUDA-specific versions of malloc and memcpy.Prefer
@Jason: actually, on Fermi GPUs, both malloc and the C++ new operator are both supported. But only for allocations which will reside in global memory. You are correct in asserting that dynamically allocated shared memory must be allocated by the calling host code (in this case as part of the kernel launch syntax or via a separate API call).Goring
@Goring : So you no longer need to use a sequence of cudaMalloc(), cudaMempy() and cudaFree() when allocating and freeing memory between the device and host? If so, with what version of Cuda did that change?Prefer
@Jason, well often that is still the best way to work, but threads can allocate their own global memory from a runtime heap if desired. Kernel malloc support was introduced in Cuda 3.1, and the new operator was added in CUDA 4.0. It is only supported on compute capability 2.0 and 2.1 devices ATM, and the performance isn't particularly great, but it is supported.Goring

© 2022 - 2024 — McMap. All rights reserved.