gpu-shared-memory Questions
1
Solved
From the CUDA Programming Guide:
[Warp shuffle functions] exchange a variable between threads within a warp.
I understand that this is an alternative to shared memory, thus it's being used for th...
Conti asked 27/4, 2023 at 18:54
3
Solved
I am currently studying CUDA and learned that there are global memory and shared memory.
I have checked the CUDA document and found that GPUs can access shared memory and global memory using ld.sha...
Carnay asked 15/11, 2022 at 5:5
2
Solved
Under what circumstances should you use the volatile keyword with a CUDA kernel's shared memory? I understand that volatile tells the compiler never to cache any values, but my question is about th...
Marthena asked 11/3, 2013 at 4:2
3
Solved
After Compute Capability 2.0 (Fermi) was released, I've wondered if there are any use cases left for shared memory. That is, when is it better to use shared memory than just let L1 perform its magi...
Plasmolysis asked 30/6, 2012 at 16:31
1
Solved
I am unable to use more than 48K of shared memory (on V100, Cuda 10.2)
I call
cudaFuncSetAttribute(my_kernel,
cudaFuncAttributePreferredSharedMemoryCarveout,
cudaSharedmemCarveoutMaxShared);
bef...
Beria asked 5/9, 2020 at 18:29
1
Solved
I make following this original post : PyCuda code to invert a high number of 3x3 matrixes.
The code suggested as an answer is :
$ cat t14.py
import numpy as np
import pycuda.driver as cuda
from pyc...
Ovalle asked 26/11, 2019 at 17:17
4
Solved
I'm currently working with CUDA programming and I'm trying to learn off of slides from a workshop I found online, which can be found here. The problem I am having is on slide 48. The following code...
Neurologist asked 29/10, 2014 at 7:42
2
Solved
I want to call different instantiations of a templated CUDA kernel with dynamically allocated shared memory in one program. My first naive approach was to write:
template<typename T>
__global...
Chari asked 19/12, 2014 at 16:58
5
Solved
i am trying to allocate shared memory by using a constant parameter but getting an error. my kernel looks like this:
__global__ void Kernel(const int count)
{
__shared__ int a[count];
}
and i a...
Otology asked 3/4, 2011 at 17:34
2
Solved
I am trying to understand how bank conflicts take place.
I have an array of size 256 in global memory and I have 256 threads in a single block, and I want to copy the array to shared memory. Theref...
Weisshorn asked 9/12, 2010 at 8:22
2
Solved
The __shared__ memory in CUDA seems to require a known size at compile time. However, in my problem, the __shared__ memory size is only know at run time, i.e.
int size=get_size();
__shared__ mem[s...
Nude asked 30/3, 2012 at 2:51
1
Solved
I am trying to allocate shared memory in a CUDA kernel within a templated class:
template<typename T, int Size>
struct SharedArray {
__device__ T* operator()(){
__shared__ T x[Size];
ret...
Trondheim asked 26/10, 2015 at 12:13
1
Solved
I'm not an experienced CUDA programmer. I got a problem like this.
I'm trying to load a 32x32 tile of a large 10Kx10K matrix from global memory into shared memory and I'm timing it while it happens...
Mcmillan asked 13/8, 2015 at 10:38
3
Solved
Consider the following code:
__global__ void kernel(int *something) {
extern __shared__ int shared_array[];
// Some operations on shared_array here.
}
Is it possible to initialize the whole sh...
Conglomeration asked 25/6, 2011 at 13:42
1
Solved
In CUDA C++ it's straightforward to define a shared memory of size specified at runtime. How can I do this with Numba/NumbaPro CUDA?
What I've done so far has only resulted in errors with the messa...
Mathers asked 28/5, 2015 at 15:14
1
Solved
I'm trying to do a parallel reduction to sum an array in CUDA. Currently i pass an array in which to store the sum of the elements in each block. This is my code:
#include <cstdlib>
#include...
Elmerelmina asked 1/12, 2014 at 14:33
1
Solved
If I have a 48kB shared memory per SM and I write a kernel where I allocate 32kB shared memory that means that only 1 block can be running on one SM at the same time?
Prosaic asked 24/9, 2014 at 21:18
1
I'm trying to do a median filter with a window x*y where x and y are odd and parameters of the program.
My idea is first see how many threads I can execute in a single block and how much shared me...
Banas asked 1/6, 2013 at 19:34
1
Solved
There are similar questions to what I'm about to ask, but I feel like none of them get at the heart of what I'm really looking for. What I have now is a CUDA method that requires defining two array...
Bainbrudge asked 24/7, 2014 at 19:13
2
Solved
I have a piece of CUDA code in which threads are performing atomic operations on shared memory. I was thinking since the result of atomic operation will be visible to other threads of the block ins...
Juanajuanita asked 13/4, 2014 at 16:55
3
The size of the shared memory ("local memory" in OpenCL terms) is only 16 KiB on most nVIDIA GPUs of today.
I have an application in which I need to create an array that has 10,000 integers. so the...
Pongee asked 13/2, 2011 at 11:4
1
Solved
I'm currently trying to adapt the 2D convolution code from THIS question to 3D and having trouble trying to understand where my error is.
My 2D Code looks like this:
#include <iostream>
#d...
Morpheus asked 22/3, 2014 at 12:53
1
Solved
As was mentioned in this Shared Memory Array Default Value question, shared memory is non-initialized, i.e. can contain any value.
#include <stdio.h>
#define BLOCK_SIZE 512
__global__ void ...
Implied asked 4/3, 2014 at 13:10
1
I just learned (from Why only one of the warps is executed by a SM in cuda?) that Kepler GPUs can actually execute instructions from several (apparently 4) warps at once.
Can a shared memory bank...
Sanhedrin asked 15/2, 2014 at 19:22
1
Solved
I am having some difficulties to understand the batch loading as in the comments is referred. In order to compute the convolution in a pixel the mask whose size is 5 must become centered on this sp...
Archibald asked 27/1, 2014 at 12:10
1 Next >
© 2022 - 2024 — McMap. All rights reserved.