efficient GPU random memory access with OpenGL

Asked 25/2, 2012 at 2:42 Answered 3/3, 2012 at 3:18

Solved opengl glsl gpu gpgpu random-access

What is the best pattern to get a GPU efficiently calculate 'anti-functional' routines, that usually depend on positioned memory writes instead of reads? Eg. like calculating a histogram, sorting, dividing a number by percentages, merging data of differing size into lists etc. etc.

Cavern answered 25/2, 2012 at 2:42 Comment(4)

Are you asking about OpenGL, OpenGL ES, or WebGL? Because those are three different answers for three different questions. Though technically, the answer will always be some form of, "It depends on the hardware." – Thug 25/2, 2012 at 2:53

What would be some features eg. OpenGL 3 has over WebGL that could makes them distinct for that? – Cavern 25/2, 2012 at 3:0

You mean, besides being able to render to multiple buffers? Well, there's UBOs, which allow shaders to have fast access to much more data. There are buffer textures, which allow shaders slower access, but to much more memory. There's transform feedback, which can store the vertex shader output in buffer objects, making tight loops avoid rasterization completely. But really, if you're doing compute on desktops, you should be trying to use OpenCL. – Thug 25/2, 2012 at 3:6

OpenCL currently is not so available like the OpenGL flavors are.. That is the reason why I don't decide for/against WebGL for example. If it is usable for my purposes, I would prefer it to get an easy deployment. Transform feedback sounds cool by the way, when combined with a geometry shader it can compress or expand the data length depending on calculation results. This is however advanced and not a WebGL possibility I think. – Cavern 25/2, 2012 at 3:26

The established terms are gather reads and scatter writes

gather reads

This means that your program will write to a fixed position (like the target fragment position of a fragment shader), but has fast access to arbitrary data sources (textures, uniforms, etc.)

scatter writes

This means, that a program receives a stream of input data which it cannot arbitarily address, but can do fast writes to arbitrary memory locations.

Clearly the shader architecture of OpenGL is a gather system. Latest OpenGL-4 also allows some scatter writes in the fragment shader, but they're slow.

So what is the most efficient way, these days, to emulate "scattering" with OpenGL. So far this is using a vertex shader operating on pixel sized points. You send in as many points as you have data-points to process and scatter them in target memory by setting their positions accordingly. You can use geometry and tesselation shaders to yield the points processed in the vertex unit. You can use texture buffers and UBOs for data input, using the vertex/point index for addressing.

Incumbent answered 25/2, 2012 at 10:48 Comment(2)

It seems scatter is always hard to optimize, as it violates cache coherency as gather does, but additionally needs memory synchronisations. I am uncertain how a GPU sort the primitives before drawing. All primitives touching a fragment need to be drawn in user defined order, as long as there is a non-trivial alpha blending or no depth buffer. However, with opaque blending and depth buffer, the order don't matter anymore. Also, non overlapping primitives can be drawn in parallel, but would require memory synchronicity if they touch the same memory rows. – Cavern 3/3, 2012 at 17:20

So maybe true scatter writes could be outperformed by a larger amount of locally restricted operations. Like gathering output values from a certain local region, eg. a tile of a texture, and render it to an intermediate output. That could be repeated until any datum reached its final memory location. This have some similarities with quick sort for example, which may be implemented eficciently this way? – Cavern 3/3, 2012 at 17:26

GPU's are built with multiple memory types. One type is the DDRx RAM that is accessible to the host CPU and the GPU. In OpenCL and CUDA this called 'global' memory. For GPUs data in global memory must be transferred between the GPU and Host. It's usually arranged in banks to allow for pipelined memory access. So random reads/writes to 'global' memory are comparatively slow. The best way to access 'global' memory is sequentially.
It's size ranges from 1G - 6B per device.

The next type, of memory, is a on the GPU. It's shared memory that is available to a number of threads/warps within a compute unit/multi-processor. This is faster than global memory but not directly accessible from the host. CUDA calls this shared memory. OpenCL calls this local memory. This is the best memory to use for random access to arrays. For CUDA there is 48K and OpenCL there is 32K.

The third kind of memory are the GPU registers, called private in OpenCL or local in CUDA. Private memory is the fastest but there is less available than local/shared memory.

The best strategy to optimize for random access to memory is to copy data between global and local/shared memory. So a GPU application will copy portions its global memory to local/shared, do work using local/shared and copy the results back to global.

The pattern of copy to local, process using local and copy back to global is an important pattern to understand and learn to program well on GPUs.

Upkeep answered 3/3, 2012 at 3:18 Comment(1)

This answer seems to answer an other question, like 'what kind of memory types and caches are used by GPU programs'. – Cavern 20/9, 2012 at 12:35

gather reads

scatter writes

Recommended topics

Hot tags