Can we benchmark how fast CUDA or OpenCL is compared to CPU performance?

Asked 24/11, 2010 at 15:2 Answered 15/9, 2016 at 16:34

How much faster can an algorithm on CUDA or OpenCL code run compared to a general single processor core? (considering the algorithm is written and optimized for both the CPU and GPU target).

I know it depends on both the graphics card and the CPU, but say, one of the fastest GPUs of NVIDIA and a (single core of a) Intel i7 processor ?

And I know it also depends on the type of algorithm.

I do not need a strict answer, but experienced examples like: for a image manipulation algorithm using double-precision floating point and 10 operations per pixel took first 5 minutes and now runs in x seconds using this hardware.

Tegan answered 24/11, 2010 at 15:2 Comment(3)

too many unknowns: fast, very fast.... – Blackstock 24/11, 2010 at 15:4

I have changed the question so it opens the possibility to say: "No, it's not possible" or "yes, there is a benchmark suite that does these kind of comparisons", etc. – Tegan 25/11, 2010 at 6:45

Related: CPU vs GPU performance comparision with OpenCL – Seedy 13/6, 2020 at 13:44

Your question is overly broad, and very difficult to answer. Moreover only a small percentage of algorithms (the ones that deal without much shared state) are feasable with GPUs.

But I do want to urge you to be critical about claims. I'm in imageprocessing, and read many an article on the subject, but quite often in the GPU case, the time to upload input data to the GPU, and download the results back to main memory is not included in the calculation of the factor.

While there are a few cases where this doesn't matter (both are small or there is a second stage calculation that further reduces the result in size), usually one does have to transfer the results and initial data.

I've seen this turning a claimed plus into a negative, because the upload/download time alone was longer than the main CPU would require to do the calculation.

Pretty much the same thing applies to combining results of different GPU cards.

Update Newer GPUs seem to be able to upload/download and calculate at the same time using ping-pong buffers. But the advise to check the border conditions thoroughly still stands. There is a lot of spin out there.

Update 2 Quite often using a GPU that is shared with video output for this is not optimal. Consider e.g. adding a low budget card for video, and using the onboard video for GPGPU tasks

Okoka answered 24/11, 2010 at 15:12 Comment(2)

Thanks, mentioning the up/downloading is valuable to know. And giving the answer that it is way too broad also. – Tegan 24/11, 2010 at 20:47

Yep i can confirm that up/downloading is slower than processing on the cpu in the end. But another thing to consider is that you can use OpenCL on a cpu device to utilize multiple processors and vector instructions (SSEx) in a pretty simple way. I've implemented some image processing functions in OpenCL and run them on the CPU which works great. (Additional plus: use SSE in Java via OpenCL on CPU) – Brachylogy 19/1, 2011 at 15:35

I think that this video introduction to OpenCL gives a good answer to your question in the first or second episode (I do not remember). I think it was at the end of the first episode...

In general it depends on how well you can "parallelize" the problem. The problem size itself is also a factor, because it costs time to copy the data to the graphics card.

Sauveur answered 24/11, 2010 at 15:6 Comment(0)

Your question is in general, hard to answer; there are simply many different variables that make it hard to give answers that are either accurate, or fair.

Notably, you are comparing both 1) choice of algorithm 2) relative performance of hardware 3) compiler optimisation ability 4) choice of implementation languages and 5) efficiency of algorithm implementation, all at the same time...

Note that, for example, different algorithms may be preferable on GPU vs CPU; and data transfers to and from GPU need to be accounted for in timings, too.

AMD has a case study (several, actually) in OpenCL performance for OpenCL code executing on the CPU and on the GPU. Here is one with performance results for sparse matrix vector multiply.

Illdefined answered 24/11, 2010 at 15:10 Comment(0)

It depends very much on the algorithm and how efficient the implementation can be.

Overall, it's fair to say that GPU is better at computation than CPUs. Thus, an upper bound is to divide the theoretical GFlops rating of a top end GPU by a top end CPU. You can do similar computation for theoretical memory bandwidth.

For example, 1581.1 GFlops for a GTX580 vs. a 107.55 GFLOPS for i7 980XE. Note that the rating for GTX580 is for single precision. I believe you need to cut that down by a factor of 4 for Fermi class non-Tesla to get to the double precision rating. So in this instance, you might expect roughly 4x.

Caveats on why you might do better (or see results which claim far bigger speedups):

GPUs has better memory bandwidth than CPU once the data is on the card. Sometimes, memory bound algorithms can do well on the GPU.
Clever use of caches (texture memory etc.) which can let you do better than advertised bandwidth.
Like Marco says, the transfer time didn't get included. I personally always include such time in my work and thus have found that the biggest speedups I've seen to be in iterative algorithms where all the data fits on the GPU (I've gotten over 300x on a midrange CPU to midrange GPU here personally).
Apples to orange comparisons. Comparing a top end GPU vs. a low end CPU is inherently unfair. The rebuttal is that a high end CPU costs much more than a high end GPU. Once you go to a GFlops/$ or GFlops/Watt comparison, it can look much more favorable to the GPU.

Chewning answered 28/2, 2011 at 15:50 Comment(0)

__kernel void vecAdd(__global float* results )
{
   int id = get_global_id(0);
}

this kernel code can spawn 16M threads on a new 60$ R7-240 GPU in 10 milliseconds.

This is equivalent to 16 thread creations or context switches in 10 nanoseconds. What is a 140$ FX-8150 8-core CPU timing? It is 1 thread in 50 nanoseconds per core.

Every instruction added in this kernel is a win for a gpu until it makes branching.

Phiz answered 15/9, 2016 at 16:34 Comment(0)

I've seen figures ranging from 2x to 400x. I also know that the middle-range GPUs cannot compete with high-range CPUs in double-precision computation - MKL on a 8-core Xeon will be faster than CULA or CUBLAS on an $300 GPU.

OpenCL is anecdotally much slower than CUDA.

Cofferdam answered 24/11, 2010 at 15:4 Comment(2)

I’ve seen figures from 0.1x to 400x. It’s important to recognize that GPUs aren’t well-suited for every task and that even a well-optimized algorithm may actually be slower (low computational density, large data set, low locality of reference, large interdependence, divergent control flow). – Kluge 24/11, 2010 at 15:22

OpenCL usually performs quite on par with CUDA nowadays. It's not exactly a surprise, they are architecturally very similar and even the implementations share a lot (the PTX IR, for example). Please also consider that OpenCL favors correctness over performance more than CUDA by default. – Lissy 24/11, 2010 at 15:35

A new benchmark suite called SHOC (Scalable Heterogeneous Computing) from Oak Ridge National Lab and Georgia Tech has both OpenCL and CUDA implementations of many important kernels. You can download the suite from http://bit.ly/shocmarx. Enjoy.

Oscan answered 24/11, 2010 at 20:52 Comment(0)

Recommended topics

Hot tags