I've recently ported my volumeraycaster from OpenGL to OpenCL, which decreased the raycaster's performance by about 90 percent. I tracked the performance decrease to the OpenCL's imagesampling functions, which are much slower than the respective OpenGL texturesampling functions. By removing the imagesampling functions and the texture sampling functions, both raycaster implementations had about the same speed. In order to easily bench the functions on different hardware, and to exclude some silly mistakes in the rest of my RTs code, I've written a small benchmark which compares the OpenCL sampling speed to the OpenGL sampling speed and tested it on different machines but OpenCL still had just about 10 % of OpenGL's performance.
The benchmark's OpenCL HostCode (at least the most important part of it) :
void OGLWidget::OCLImageSampleTest()
{
try
{
int size=8;
float Values[4*size*size*size];
cl::Kernel kernel=cl::Kernel(program,"ImageSampleTest",NULL);
cl::ImageFormat FormatA(CL_RGBA,CL_FLOAT);
cl::Image3D CLImage(CLcontext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR ,FormatA,size,size,size,0,0,Values,NULL);
cl::ImageFormat FormatB(CL_RGBA,CL_UNSIGNED_INT8);
cl::Image2D TempImage(CLcontext, CL_MEM_WRITE_ONLY,FormatB,1024,1024,0,NULL,NULL );
kernel.setArg(0, CLImage);
kernel.setArg(1, TempImage);
cl::Sampler Samp;
Samp() = clCreateSampler( CLcontext(), CL_TRUE, CL_ADDRESS_REPEAT, CL_FILTER_LINEAR, NULL);
kernel.setArg(2, Samp);
QTime BenchmarkTimer=QTime();
BenchmarkTimer.start();
cl::KernelFunctor func = kernel.bind(queue, cl::NDRange(1024,1024), cl::NDRange(32,32));
func().wait();
int Duration = BenchmarkTimer.elapsed();
printf("OCLImageSampleTest: %d ms \n", Duration);
}
catch (cl::Error& err)
{
std::cerr << "An OpenCL error occured, " << err.what()
<< "\nError num of " << err.err() << "\n";
return;
}
}
OpenCL Kernel:
void kernel ImageSampleTest( read_only image3d_t CoordTexture, write_only image2d_t FrameBuffer, sampler_t smp)
{
int Screenx = get_global_id(0);
int Screeny = get_global_id(1);
int2 PositionOnScreen=(int2)(Screenx,Screeny) ;
float4 Testvec=(float4)(1,1,1,1);
for(int i=0; i< 2000; i++)
{
Testvec+= read_imagef(CoordTexture,smp, (float4)(0+0.00000001*i,0,0,0)); // i makes sure that the compiler doesn't unroll the loop
}
uint4 ToInt=(uint4)( (uint) (Testvec.x), (uint) (Testvec.y) ,(uint)(Testvec.z),1);
write_imageui ( FrameBuffer, PositionOnScreen, ToInt );
}
OpenGL FragmentShader for a fullscreen quad which has the same amount of fragments as the OpenCL kernel has work items:
#version 150
uniform sampler3D Tex;
out vec4 FragColor;
void main()
{
FragColor=vec4(0,0,0,0);
for(int i=0; i<2000; i++)
{
FragColor+= texture(Tex,vec3(0+0.00000001*i,0,0),0);
}
}
Furthermore I've already tried the following to increase the performance:
-changing workgroup size: no performance increase
-Different Hardware: 280 GTX, 580 GTX, some Fermi Tessla card , all of them had the same abysmal performance in OpenCL vs OpenGL
-Different texture formats (bytes instead of floats), different access patterns and different texture sizes: no increase
-Using a buffer instead of an image for the data and a self written trilinear interpolation function for the sampling in the CL Kernel: Increased the OpenCL performance by about 100 %
-Using a 2D image//texture instead of a 3D image//texture:This increased the OpenCL performance by 100 % although the OpenGL performance didnt change at all.
-Using "nearest" instead of "linear" interpolation: No performance change
This left me wondering: Did I do a very stupid mistake which decreases the OpenCL performance? Why is the OpenCL sampling performance so low, although it should use the same texture hardware as OpenGL? Why is my complex trilinear interpolation function implementation faster than its hardware implementation? How can I increase the sampling performance in OpenCL so that i can have the same speed as in OpenGL?