Note that there are actually two kinds of those. You have CL_MEM_READ_ONLY
, CL_MEM_WRITE_ONLY
and CL_MEM_READ_WRITE
when allocating your buffers but then you also have __read_only
, __write_only
and __read_write
to decorate your image buffers in the kernel code with.
These could be used for both optimization and error checking. Lets look at the performance first. If a write-only buffer is encountered, the writes need not be cached (as in write through cache), saving more cache for the reads. This depends on the GPU hardware a lot and at least NVIDIA hardware does have the instructions needed to actually implement this (the .cs
and .lu
modifiers). You can refer to their PTX ISA. I haven't seen any evidence of the compiler actually performing this optimization, e.g.:
__kernel void Memset4(__global /*__write_only*/ unsigned int *p_dest,
const unsigned int n_dword_num)
{
unsigned int i = get_global_id(0);
if(i < n_dword_num)
p_dest[i] = 0; // this
}
gets compiled as:
st.global.u32 [%r10], %r11; // no cache operation specified
This makes sense as CUDA does not have equivalents for those qualifiers so the compiler most likely silently ignores those. But it does not hurt to put them there, we might get luckier in the future. In CUDA, some of this functionality is exposed using the __ldg
function and by using compiler flags to opt in/out of caching the global memory transfers in L1 (-Xptxas -dlcm=cg
). You can also always use asm
if you find that bypassing cache yields a major advantage.
As for error checking, writing to a read-only buffer is readily avoided using the const
specifier in the kernel declaration. Disallowing reading from a write-only buffer is not possible in pure "C".
Another possible optimization happens when mapping those buffers to host memory. When mapping a CL_MEM_READ_ONLY
buffer, the mapped region could be left uninitialized as the host will only write to that memory, for the device to only read it. Similarly, when un-mapping a CL_MEM_WRITE_ONLY
buffer, the driver does not need to copy the (potentially modified by the host) contents from the host memory to the device memory. I did not measure this.
As a side note, I have tried using:
inline unsigned int n_StreamingLoad(__global /*__read_only*/ const unsigned int *p_src)
{
#ifdef NVIDIA
unsigned int n_result;
asm("ld.global.cs.u32 %r0, [%r1];" : "=r" (n_result) : "r" (p_src));
return n_result;
#else // NVIDIA
return *p_src; // generic
#endif // NVIDIA
}
inline void StreamingWrite(__global /*__write_only*/ unsigned int *p_dest, const unsigned int n_value)
{
#ifdef NVIDIA
asm("st.global.cs.u32 [%r0], %r1;" : : "r" (p_dest), "r" (n_value) : "memory");
#else // NVIDIA
*p_dest = n_value; // generic
#endif // NVIDIA
}
which gives you about 15 extra GB/sec even on a simple memcpy kernel with sm_35
devices (tested on GTX 780 and K40). Haven't seen noticeable speedup on sm_30
(not sure if its even meant to be supported there - although the instructions are not being stripped from ptx). Note that you need to define NVIDIA
yourself (or see Detect OpenCL device vendor in kernel code).