There are quite a few stackoverflow threads asking why a kernel using textures is not faster than one using global memory access. The answers and comments seem always a little bit esoteric to me.
The NVIDIA white paper on the Fermi architecture states black on white:
The Fermi architecture addresses this challenge by implementing a single unified memory request path for loads and stores, with an L1 cache per SM multiprocessor and unified L2 cache that services all operations (load, store and texture).
So why on earth should one expect any speed up from using texture memory on Fermi devices, since for every memory fetch (regardless wether it's bound to a texture or not) the same L2 cache is used. Actually for most cases direct access to global memory should be faster since it is also cached through L1 which a texture fetch isn't. This is also reported in a few related questions here on stackoverflow.
Can someone confirm this or show me what I'm missing?