Abysmal OpenCL ImageSampling performance vs OpenGL TextureSampling
Asked Answered
F

1

55

I've recently ported my volumeraycaster from OpenGL to OpenCL, which decreased the raycaster's performance by about 90 percent. I tracked the performance decrease to the OpenCL's imagesampling functions, which are much slower than the respective OpenGL texturesampling functions. By removing the imagesampling functions and the texture sampling functions, both raycaster implementations had about the same speed. In order to easily bench the functions on different hardware, and to exclude some silly mistakes in the rest of my RTs code, I've written a small benchmark which compares the OpenCL sampling speed to the OpenGL sampling speed and tested it on different machines but OpenCL still had just about 10 % of OpenGL's performance.

The benchmark's OpenCL HostCode (at least the most important part of it) :

void OGLWidget::OCLImageSampleTest()
{
    try
    {
    int size=8;
    float Values[4*size*size*size];
    cl::Kernel kernel=cl::Kernel(program,"ImageSampleTest",NULL);
    cl::ImageFormat FormatA(CL_RGBA,CL_FLOAT);
    cl::Image3D CLImage(CLcontext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR ,FormatA,size,size,size,0,0,Values,NULL);


    cl::ImageFormat FormatB(CL_RGBA,CL_UNSIGNED_INT8);
    cl::Image2D TempImage(CLcontext, CL_MEM_WRITE_ONLY,FormatB,1024,1024,0,NULL,NULL );


    kernel.setArg(0, CLImage);
    kernel.setArg(1, TempImage);



    cl::Sampler Samp;
    Samp() = clCreateSampler( CLcontext(), CL_TRUE, CL_ADDRESS_REPEAT, CL_FILTER_LINEAR, NULL);
    kernel.setArg(2, Samp);

    QTime BenchmarkTimer=QTime();
    BenchmarkTimer.start();

    cl::KernelFunctor func = kernel.bind(queue, cl::NDRange(1024,1024), cl::NDRange(32,32));
    func().wait();

    int Duration =  BenchmarkTimer.elapsed();
    printf("OCLImageSampleTest: %d ms \n", Duration);
    }
    catch (cl::Error& err)
      {
        std::cerr << "An OpenCL error occured, " << err.what()
                  << "\nError num of " << err.err() << "\n";
        return;
      }

}

OpenCL Kernel:

void kernel ImageSampleTest( read_only image3d_t CoordTexture, write_only image2d_t FrameBuffer, sampler_t smp)
{
int Screenx = get_global_id(0);
int Screeny = get_global_id(1);

int2 PositionOnScreen=(int2)(Screenx,Screeny) ;

float4 Testvec=(float4)(1,1,1,1);
for(int i=0; i< 2000; i++)
{
Testvec+= read_imagef(CoordTexture,smp, (float4)(0+0.00000001*i,0,0,0)); // i makes sure that the compiler doesn't unroll the loop
}

uint4 ToInt=(uint4)( (uint) (Testvec.x), (uint) (Testvec.y) ,(uint)(Testvec.z),1);
write_imageui (     FrameBuffer,  PositionOnScreen, ToInt ); 

}

OpenGL FragmentShader for a fullscreen quad which has the same amount of fragments as the OpenCL kernel has work items:

#version 150
uniform sampler3D Tex;
out vec4 FragColor;

void main()
{
FragColor=vec4(0,0,0,0);
for(int i=0; i<2000; i++)
{
FragColor+= texture(Tex,vec3(0+0.00000001*i,0,0),0);
}
}

Furthermore I've already tried the following to increase the performance:

-changing workgroup size: no performance increase

-Different Hardware: 280 GTX, 580 GTX, some Fermi Tessla card , all of them had the same abysmal performance in OpenCL vs OpenGL

-Different texture formats (bytes instead of floats), different access patterns and different texture sizes: no increase

-Using a buffer instead of an image for the data and a self written trilinear interpolation function for the sampling in the CL Kernel: Increased the OpenCL performance by about 100 %

-Using a 2D image//texture instead of a 3D image//texture:This increased the OpenCL performance by 100 % although the OpenGL performance didnt change at all.

-Using "nearest" instead of "linear" interpolation: No performance change

This left me wondering: Did I do a very stupid mistake which decreases the OpenCL performance? Why is the OpenCL sampling performance so low, although it should use the same texture hardware as OpenGL? Why is my complex trilinear interpolation function implementation faster than its hardware implementation? How can I increase the sampling performance in OpenCL so that i can have the same speed as in OpenGL?

Fiche answered 11/6, 2012 at 16:18 Comment(10)
Do you have the latest drivers? I'm sure the OpenGL bits haven't changed recently, but the OpenCL stuff should have!Madaih
Yes, they are version 301.32; the same which Nvidia currently offers at its download page.Fiche
Have you tried using CL/GL interop? (khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/…) I have used this in the past because I was doing a fair bit of rendering using OpenGL along with computes using OpenCL. This is probably not your final solution - but it might help shed light on the actual problem.Madaih
No unfortunately not yet, but ill try it asap.Fiche
It appears that the reads are all coming from the same location. Perhaps OpenGL does a better job of caching this result than OpenCL. Do you get similar performance if you read from unique locations (perhaps make the location a transformed version of the PositionOnScreen)?Undetermined
I'm a bit rusty at OCL, isn't the kernal doing a memcopy when it inits which is slow? I mean, is the image already in mem like the framebuffer? In my experience with OpenCL, the initial memory init is the slowest part 9 times out of 10. Therefore is suited for many iterations over that data, not initializing kernels.Toadinthehole
I have the feeling that OpenCL is not taking the liberty of swizzling your texture in memory : c.f. en.wikipedia.org/wiki/Z-order_curve. Which would results in very bad cache access performance, ESPECIALLY with 3D textures.Fourflush
Another point I'd like to add here is that Nvidia is actively trying to squash OpenCL usage to promote CUDA (and hence vendor lock-in). I wouldn't depend on their OpenCL implementation outperforming the same program running on an (comparable) Intel or AMD GPU.Madaih
Why are you prohibiting the compiler from unrolling your loop? The forloop will have many iterations and much of your time will be spent executing compare and jump instructions. IIRC graphics hardware don't have branch prediction which would make it slow?Frendel
@Fiche It's been 2.5 yrs since this question was originally posted. I'm curious: are you or other users still seeing OpenGL outperform OpenCL (for this test) with more recent versions of OpenCL and OpenGL on newer hardware?Needs
S
3

I suspect there is some issue with OpenCL in latest NVidia drivers on some video cards. Here and here are some reports about those. Try to repeat test on GPU from another family.

Sabol answered 4/11, 2013 at 7:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.