Is there a way of setting default value for shared memory array?

Asked 25/6, 2011 at 13:42 Answered 17/7, 2015 at 0:14

Consider the following code:

__global__ void kernel(int *something) {
    extern __shared__ int shared_array[];     

    // Some operations on shared_array here.
}

Is it possible to initialize the whole shared_array to some value - e.g. 0 - without explicitly addressing each cell in some thread?

Conglomeration answered 25/6, 2011 at 13:42 Comment(0)

No. Shared memory is uninitialised. You have to somehow initialise it yourself, one way or another...

From CUDA C Programming Guide 3.2, Section B.2.4.2, paragraph 2:

__shared__ variables cannot have an initialization as part of their declaration.

This also discards nontrivial default constructors for shared variables.

Sparge answered 25/6, 2011 at 14:10 Comment(0)

You can efficiently initialize shared arrays in parallel like this

// if SHARED_SIZE == blockDim.x, eliminate this loop
for (int i = threadIdx.x; i < SHARED_SIZE; i += blockDim.x) 
    shared_array[i] = INITIAL_VALUE;
__syncthreads();

Quenchless answered 27/6, 2011 at 1:16 Comment(5)

That's only the case if you have a 1D block, of course. Saying just so any newbies don't fall into obvious traps. I also wonder how much boost float4, which is another trick still gives on newer devices and how much benefit does it provide combined with this memory coalescing type of init. Sidenote, if you're loading inside a 2D, or 3D kernel, it's importrant to know that they are partitioned into warps like array of [z][y][x] is inside memory. So let [x] threads write closest to each other and those different in their [z] furthest. – Celeriac 19/8, 2016 at 17:48

So I've tried and yes, using reinterpret_cast<float4*> to copy in 16-byte chunks like in devblogs.nvidia.com/parallelforall/… still gets a little bit better results. Also, it's important to set restrict on the data you want to copy --- easy +15% performance boost, about as much as 16-byte aligned chunks give. – Celeriac 19/8, 2016 at 19:42

Mind that memory coalescing can mean a 16x speed difference easily. – Celeriac 19/8, 2016 at 20:27

@ÍhorMé Not sure why you say this can't work for a 2D or 3D block. You need to do the addressing appropriately, but the general approach works regardless of block dimension. – Quenchless 22/8, 2016 at 2:12

Yes, I'm talking about the exact code snippet. The basic idea will work, but in order to implement it with 2D/3D blocks one also has to know how blocks are partitioned into warps to make sure that memory will coalesce. – Celeriac 23/8, 2016 at 10:42

No. Shared memory is uninitialised. You have to somehow initialise it yourself, one way or another...

From CUDA C Programming Guide 3.2, Section B.2.4.2, paragraph 2:

__shared__ variables cannot have an initialization as part of their declaration.

This also discards nontrivial default constructors for shared variables.

Sparge answered 25/6, 2011 at 14:10 Comment(0)

Yes, you can. You can specify that the first thread in the block sets it, while the other's don't eg.:

extern __shared__ unsigned int local_bin[]; // Size specified in kernel call

if (threadIdx.x == 0) // Wipe on first thread - include " && threadIdx.y == 0" and " && threadIdx.z == 0"  if threadblock has 2 or 3 dimensions instead of 1.
{
    // For-loop to set all local_bin array indexes to specified value here - note you cannot use cudaMemset as it translates to a kernel call itself
}

// Do stuff unrelated to local_bin here    

__syncthreads(); // To make sure the memset above has completed before other threads start writing values to local_bin.

// Do stuff to local_bin here

Ideally you should do as much work as possible before the syncthreads call, as this allows for all the other threads to do their work before the memset is complete - obviously this only matters if the work has the potential to have quite different thread completion times, for example if there is conditional branching. Note that for the thread 0 "setting" for-loop, you need to have passed the size of the local_bin array as a parameter to the kernel so you know the size of the array you are iterating.

Original concept source

Unstop answered 17/7, 2015 at 0:14 Comment(3)

Thank you. I used something similar to this in my final implementation. – Mountie 30/5, 2017 at 5:48

this loses the benefits of parallelization, always try to use threadIdx and BlockIdx as much as possible – Kizzykjersti 25/9, 2017 at 12:28

I think you're right in general, but it depends on whether the initialization value can be calculated from those or whether it has a more complex numerical/etc assignment pattern. – Unstop 26/9, 2017 at 0:54

Recommended topics

Hot tags