GPU Shared Memory Bank Conflict

Asked 9/12, 2010 at 8:22 Answered 18/4, 2017 at 10:21

Solved c++cuda gpgpu gpu-shared-memory bank-conflict

I am trying to understand how bank conflicts take place.
I have an array of size 256 in global memory and I have 256 threads in a single block, and I want to copy the array to shared memory. Therefore every thread copies one element.

shared_a[threadIdx.x]=global_a[threadIdx.x]

Does this simple action result in a bank conflict?

Suppose now that the size of the array is larger than the number of threads, so I am now using this to copy the global memory to the shared memory:

tid = threadIdx.x;
for(int i=0;tid+i<N;i+=blockDim.x)
     shared_a[tid+i]=global_a[tid+i];

Does the above code result in a bank conflict?

Weisshorn answered 9/12, 2010 at 8:22 Comment(0)

The best way to check this would be to profile your code using the "Compute Visual Profiler"; this comes with the CUDA Toolkit. Also there's a great section in GPU Gems 3 on this - "39.2.3 Avoiding Bank Conflicts".

"When multiple threads in the same warp access the same bank, a bank conflict occurs unless all threads of the warp access the same address within the same 32-bit word" - First thing there are 16 memory banks each 4bytes wide. So essentially, if you have any thread in a half warp reading memory from the same 4bytes in a shared memory bank, you're going to have bank conflicts and serialization etc.

OK so your first example:

First lets assume your arrays are say for example of the type int (a 32-bit word). Your code saves these ints into shared memory, across any half warp the Kth thread is saving to the Kth memory bank. So for example thread 0 of the first half warp will save to shared_a[0] which is in the first memory bank, thread 1 will save to shared_a[1], each half warp has 16 threads these map to the 16 4byte banks. In the next half warp, the first thread will now save its value into shared_a[16] which is in the first memory bank again. So if you use a 4byte word such int, float etc then your first example will not result in a bank conflict. If you use a 1 byte word such as char, in the first half warp threads 0, 1, 2 and 3 will all save their values to the first bank of shared memory which will cause a bank conflict.

Second example:

Again this will all depend on the size of the word you are using, but for the example I'll use a 4byte word. So looking at the first half warp:

Number of threads = 32

N = 64

Thread 0: Will write to 0, 31, 63 Thread 1: Will write to 1, 32

All threads across the half warp execute concurrently so the writes to shared memory shouldn't cause bank conflicts. I'll have to double check this one though.

Hope this helps, sorry for the huge reply!

Locksmith answered 9/12, 2010 at 9:46 Comment(9)

actually for the second part, thread 0 will write to 0,32 and thread 1 will write to 1,33 and so on..... until the last thread 31 writes to 31,63. but thanks for the 1st part of ur post.It was very informative – Weisshorn 9/12, 2010 at 10:2

Edited to reflect your comment, does this answer your question? – Locksmith 9/12, 2010 at 10:31

Note that on sm_20 and later devices there are 32 banks, and accesses must be considered per-warp instead of per-half-warp. – Storyteller 9/12, 2010 at 16:22

Regarding the second part of the question, you're correct there are no bank conflicts since thread 0 writes to 0, 32, 64 and thread 1 to 1, 33, 65 and so on (slight fix from your answer). This is commonly written as for (int i = tid ; i < N ; i += blockDim.x) shared_a[i] = global_a[i]; – Storyteller 9/12, 2010 at 16:25

So what happens exactly when a warp of 32 threads (talking for sm_20 and above cards) tries to write a 32 integer array (2 bytes per int = half word) into shared memory? Is this gonna lead to bank conflicts (thus serialization) ? – Hearttoheart 9/7, 2014 at 14:18

Sorry for commenting on an old question. I happend to come across this question by a link and it confused me about the case with 1 byte element. I think this answer needs some update since for recent GPUs, 1 byte consecutive access to shared memory does not result in bank conflict. I posted an answer. Thanks for reading! – Anoxemia 1/6, 2017 at 16:17

@Madhatter It depends on whether each thread's access is distributed to each bank or not. In the case of consecutive access to shared memory, it will not cause a bank conflict. – Anoxemia 2/6, 2017 at 1:45

What does it means: In the next half warp, the first thread will now save its value into shared_a[16] which is in the first memory bank again ? Why the shared_a[16] is in the first memory bank again? This seems very confusing. – Sedative 2/8, 2022 at 8:1

The new link to the "avoiding bank conflicts" section: developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/… – Troostite 31/10, 2023 at 22:10

In both cases threads access shared memory with consecutive address. It depends on the element size of shared memory, but consecutive access to shared memory by a warp of threads does not result in a bank conflict for "small" element sizes.

Profiling this code with NVIDIA Visual Profiler shows that for element size smaller than 32 and a multiple of 4 (4, 8, 12, ... , 28), consecutive access to the shared memory does not result in a bank conflict. Element size of 32, however, results in bank conflict.

Answer by Ljdawson contains some outdated information:

... If you use a 1 byte word such as char, in the first half warp threads 0, 1, 2 and 3 will all save their values to the first bank of shared memory which will cause a bank conflict.

This may be true for old GPUs, but for recent GPUs with cc >= 2.x, they don't cause bank conflicts, effectively due to the broadcast mechanism(link). Following quote is from CUDA C PROGRAMMING GUIDE (v8.0.61) G3.3. Shared Memory.

A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word (even though the two addresses fall in the same bank): In that case, for read accesses, the word is broadcast to the requesting threads (multiple words can be broadcast in a single transaction) and for write accesses, each address is written by only one of the threads (which thread performs the write is undefined).

This means, in particular, that there are no bank conflicts if an array of char is accessed as follows, for example:
   extern __shared__ char shared[];
   char data = shared[BaseIndex + tid];

Anoxemia answered 18/4, 2017 at 10:21 Comment(0)

Recommended topics

Hot tags