Yet Another CUDA Texture Memory Thread. (Why should texture memory be faster on Fermi?)

Asked 13/9, 2014 at 8:3 Answered 15/9, 2014 at 12:26

There are quite a few stackoverflow threads asking why a kernel using textures is not faster than one using global memory access. The answers and comments seem always a little bit esoteric to me.

The NVIDIA white paper on the Fermi architecture states black on white:

The Fermi architecture addresses this challenge by implementing a single unified memory request path for loads and stores, with an L1 cache per SM multiprocessor and unified L2 cache that services all operations (load, store and texture).

So why on earth should one expect any speed up from using texture memory on Fermi devices, since for every memory fetch (regardless wether it's bound to a texture or not) the same L2 cache is used. Actually for most cases direct access to global memory should be faster since it is also cached through L1 which a texture fetch isn't. This is also reported in a few related questions here on stackoverflow.

Can someone confirm this or show me what I'm missing?

Putout answered 13/9, 2014 at 8:3 Comment(2)

There is a texture cache on each Streaming Multiprocessor. This cache can better address data locality for 2D accesses, for example, for stencil calculations in finite difference approaches. Texture memory is indeed somewhat faster than global memory accesses, which are cached as well but with a different mechanism. For some timings, see my answer to Is 1D texture memory access faster than 1D global memory access?. – Duley 13/9, 2014 at 13:44

@JackOLantern do you want to provide an answer? I would upvote. Any access speed-up from texturing comes about as a result of the texture cache, which OP seems to be ignoring. – Overexcite 13/9, 2014 at 14:57

You are neglecting that each Streaming Multiprocessor has a texture cache (see the picture below illustrating a Streaming Multiprocessor for Fermi).

enter image description here

Texture cache has a different meaning than L1/L2 cache, since it is optimized for data locality. Data locality applies to all the cases when data concerning semantically (not physically) neighboring points of regular, Cartesian, 1D, 2D or 3D grids must be accessed. To better explain this concept, consider the following figure illustrating the stencil as involved in 2D or 3D finite difference calculations

enter image description here

Calculating finite differences at the red point involves accessing the data associated to the blue points. Now, these data aren't physical neighbors of the red points since they will not be physically stored consecutively in global memory when flattening the 2D or 3D array to 1D. However, they are semantical neighbors of the red points and texture memory is right good at caching these values. On the other side, L1/L2 caches are good when the same datum or its physical neighbors must be frequently accessed.

The other side of the medal is that texture cache as a higher latency as compared to L1/L2 cache, so, in some cases, not using texture may not lead to a significany worsening of the performance, just thanks to the L1/L2 caching mechanism. From this point of view, texture had top importance in the early CUDA architectures, when global memory reads were not cached. But, as demonstrated in Is 1D texture memory access faster than 1D global memory access?, texture memory for Fermi is worth to be used.

Duley answered 13/9, 2014 at 20:43 Comment(3)

Thank you very much for your superb answer! Can you perhaps point me to where I can find such information? Prior to posting, I read the whitepaper on the fermi architecture, there is no talk of the special hardware texture cache and the picture of the sm is cut directly under 'Uniform Cache'. In the programming guide there is a quite big section on how to use textures, but there is nothing about special hardware cache on fermi cards. This together with the phrase 'and unified L2 cache that services all operations (load, store AND TEXTURE)', led me to the obviously very wrong conclusion. – Putout 13/9, 2014 at 23:33

Ok I read the linked whitepaper again and I must admit that I'm again not sure if there really exists a hardware texture cache in fermi. Reading the whitepaper it seems like textures are using the L2 cache but with different memory accessing schemes. Would be nice if someone will comment on this. Please forgive me for not closing this question yet! Also JackOLantern answer is very good, it does not comment on the linked whitepaper and the question if there is a true seperate hardware cache on fermi. – Putout 14/9, 2014 at 12:43

Please excuse me for not beliving. The bit of information which made me belive could be found in the current programming guid (6.5) on page 181. – Putout 14/9, 2014 at 13:30

If the data being read via texture is 2D or 3D, the block linear layout of CUDA arrays usually gives better reuse than pitch-linear layouts, because cache lines contain 2D or 3D blocks of data instead of rows.

But even for 1D data, it's possible for the texture cache to complement the other on-chip cache resources. If the kernel is only using global memory accesses with no texture loads, all of that memory traffic goes through the per-SM L1 cache. If some of the kernel's input data gets read through texture, the per-SM texture cache will relieve some pressure from the L1 and enable it to service memory traffic that would otherwise go to the L2.

When making these tradeoffs, it's important to pay attention to the decisions NVIDIA has made from one chip architecture to the next. The texture caches in Maxwell are shared with the L1, which makes reading from texture less desirable.

Litman answered 13/9, 2014 at 20:45 Comment(3)

+1. I would say that perhaps already starting with the Kepler architecture and the introduction of the read-only cache texture is less desirable. My (limited) experience is that nowadays texture is mainly for "backward compatibility" and perhaps already with Kepler, as mentioned, the game isn't worth the candle in many cases. – Duley 14/9, 2014 at 19:47

I'm not sure what you mean by 'read only cache texture," the texture cache has always been read only. The LDG instructions added in SM 3.5 are a wrapper of reads-via-texture without the need to bind texture memory, but they still go through the texture cache. I agree that if you're not using core texturing features (format conversion, bilinear or trilinear interpolation, etc.) then reading via texture is not bound to be a win. – Litman 15/9, 2014 at 0:12

Sorry, I think I have used a wrong punctuation, missing a comma in between "cache" and "texture" and forgetting to mention the binding. Anyway, I exactly meant, in an intricated way, what you have more plainly explained in your comment and my sentence was meant to be read as: "...and the introduction of the read-only cache, texture binding [in the sense of core texturing features)]is less desirable". – Duley 15/9, 2014 at 7:11

I would not disregard the usage of texture memory. See e.g. the paper 'Communication-Minimizing 2D Convolution in GPU Registers' (http://parlab.eecs.berkeley.edu/publication/899) where they are comparing different implementations of small 2D convolution and the strategy of loading from texture memory directly into registers seems to be a good way according to them.

Glary answered 15/9, 2014 at 12:26 Comment(0)

Recommended topics

Hot tags