Aparapi GPU execution slower than CPU

Asked 30/9, 2015 at 5:54 Answered 21/7, 2018 at 21:21

I am trying to test the performance of Aparapi. I have seen some blogs where the results show that Aparapi does improve the performance while doing data parallel operations.

But I am not able to see that in my tests. Here is what I did, I wrote two programs, one using Aparapi, the other one using normal loops.

Program 1: In Aparapi

import com.amd.aparapi.Kernel;
import com.amd.aparapi.Range;

public class App 
{
    public static void main( String[] args )
    {
        final int size = 50000000;

        final float[] a = new float[size];
        final float[] b = new float[size];

        for (int i = 0; i < size; i++) {
           a[i] = (float) (Math.random() * 100);
           b[i] = (float) (Math.random() * 100);
        }

        final float[] sum = new float[size];

        Kernel kernel = new Kernel(){
           @Override public void run() {
              int gid = getGlobalId();
              sum[gid] = a[gid] + b[gid];
           }
        };
        long t1 = System.currentTimeMillis();
        kernel.execute(Range.create(size));
        long t2 = System.currentTimeMillis();
        System.out.println("Execution mode = "+kernel.getExecutionMode());
        kernel.dispose();
        System.out.println(t2-t1);
    }
}

Program 2: using loops

public class App2 {

    public static void main(String[] args) {

        final int size = 50000000;

        final float[] a = new float[size];
        final float[] b = new float[size];

        for (int i = 0; i < size; i++) {
           a[i] = (float) (Math.random() * 100);
           b[i] = (float) (Math.random() * 100);
        }

        final float[] sum = new float[size];
        long t1 = System.currentTimeMillis();
        for(int i=0;i<size;i++) {
            sum[i]=a[i]+b[i];
        }

        long t2 = System.currentTimeMillis();
        System.out.println(t2-t1);

    }
}

Program 1 takes around 330ms whereas Program 2 takes only around 55ms. Am I doing something wrong here? I did printout the execution mode in Aparpai program and it prints that the mode of execution is GPU

Moor answered 30/9, 2015 at 5:54 Comment(2)

Just an idea: doesn't your timing include the GPU initialisation time? Could you try again with 2 runs in a row inside your code (the first serving as a "warm-up" and the second being for real) and only timing the second? – Circumflex 30/9, 2015 at 6:35

@Circumflex thanks for the suggestion, i ran each loop 4 times, and the result is still the same- 862ms vs 216ms – Moor 30/9, 2015 at 14:40

You did not do anything wrong - execpt for the benchmark itself.

Benchmarking is always tricky, and particularly for the cases where a JIT is involved (as for Java), and for libraries where many nitty-gritty details are hidden from the user (as for Aparapi). And in both cases, you should at least execute the code section that you want to benchmark multiple times.

For the Java version, one might expect the computation time for a single execution of the loop to decrease when the loop itself it is executed multiple times, due to the JIT kicking in. There are many additional caveats to consider - for details, you should refer to this answer. In this simple test, the effect of the JIT may not really be noticable, but in more realistic or complex scenarios, this will make a difference. Anyhow: When repeating the loop for 10 times, the time for a single execution of the loop on my machine was about 70 milliseconds.

For the Aparapi version, the point of possible GPU initialization was already mentioned in the comments. And here, this is indeed the main problem: When running the kernel 10 times, the timings on my machine are

You see that the initial call causes all the overhead. The reason for this is that, during the first call to Kernel#execute(), it has to do all the initializations (basically converting the bytecode to OpenCL, compile the OpenCL code etc.). This is also mentioned in the documentation of the KernelRunner class:

The KernelRunner is created lazily as a result of calling Kernel.execute().

The effect of this - namely, a comparatively large delay for the first execution - has lead to this question on the Aparapi mailing list: A way to eagerly create KernelRunners. The only workaround suggested there was to create an "initialization call" like

kernel.execute(Range.create(1));

without a real workload, only to trigger the whole setup, so that the subsequent calls are fast. (This also works for your example).

You may have noticed that, even after the initialization, the Aparapi version is still not faster than the plain Java version. The reason for that is that the task of a simple vector addition like this is memory bound - for details, you may refer to this answer, which explains this term and some issues with GPU programming in general.

As an overly suggestive example for a case where you might benefit from the GPU, you might want to modify your test, in order to create an artificial compute bound task: When you change the kernel to involve some expensive trigonometric functions, like this

Kernel kernel = new Kernel() {
    @Override
    public void run() {
        int gid = getGlobalId();
        sum[gid] = (float)(Math.cos(Math.sin(a[gid])) + Math.sin(Math.cos(b[gid])));
    }
};

and the plain Java loop version accordingly, like this

for (int i = 0; i < size; i++) {
    sum[i] = (float)(Math.cos(Math.sin(a[i])) + Math.sin(Math.cos(b[i])));;
}

then you will see a difference. On my machine (GeForce 970 GPU vs. AMD K10 CPU) the timings are about 140 milliseconds for the Aparapi version, and a whopping 12000 milliseconds for the plain Java version - that's a speedup of nearly 90 through Aparapi!

Also note that even in CPU mode, Aparapi may offer an advantage compared to plain Java. On my machine, in CPU mode, Aparapi needs only 2300 milliseconds, because it still parallelizes the execution using a Java thread pool.

Epistyle answered 16/10, 2015 at 19:30 Comment(4)

Thanks Marco, I changed the code as per your suggestion, but now I am getting warning "WARNING: Reverting to Java Thread Pool (JTP) for class JavaCL.App$1: FP64 required but not supported" and is falling back to JTP. I changed the code back to simple vector addition and the execution mode is GPU. My graphics card is AMD radeon r9 370 mx. Does it mean that my graphics card does not support FP64 – Moor 16/10, 2015 at 20:55

I couldn't figure it out from here : amd.com/en-us/products/graphics/notebook/r9-m200 or from here graphics-cards.specout.com/l/6008/AMD-Radeon-R9-M370X – Moor 16/10, 2015 at 21:4

Yes, the message basically means that the graphics card does not support double computations, but only float. I'm pretty sure that it should be possible to do the sin/cos computations with float arguments only, so that the code can properly be translated to sinf/cosf calls in Aparapi, but I'd have to try this out or have another look at the source code. However, the main point was to show that with more computations, the GPU will be faster than the CPU. You might observe the same effect with something like result[gid] = a[gid]*b[gid]*a[gid]*b[gid]*a[gid]*b[gid]; or so... – Epistyle 16/10, 2015 at 23:37

Just for comparison, a pipelined, host code optimized c++ opencl program using same kernel(but with a float4) for 64M floats, yields 190 milliseconds on a R7-240 and 67 milliseconds on an FX8150 and 60 milliseconds using both. So Aparapi is absolutely doing its job. This is a pci-e bandwidth bottlenecked operation for a gpu and cpu-opencl simply isn't as fast as AVX code(or SSE). Performance may vary %10 to %90 but code writing time decreases %1000 to %10000. – Ineducable 19/9, 2016 at 17:19

Just add before main loop kernel execution

kernel.setExplicit(true);
kernel.put(a);
kernel.put(b);

and

kernel.get(sum);

after it.

Although Aparapi does analyze the byte code of the Kernel.run() method (and any method reachable from Kernel.run()) Aparapi has no visibility to the call site. In the above code there is no way for Aparapi to detect that that hugeArray is not modified within the for loop body. Unfortunately, Aparapi must default to being ‘safe’ and copy the contents of hugeArray backwards and forwards to the GPU device.

https://github.com/aparapi/aparapi/blob/master/doc/ExplicitBufferHandling.md

Cymry answered 21/7, 2018 at 21:21 Comment(1)

This is an underrated answer. However I'm not seeing any significant performance improvements. This might be because of reasons mentioned in the accepted answer – Jyoti 31/7, 2021 at 16:34

Recommended topics

Hot tags