how to optimize matrix multiplication using OpenACC?

I am learning OpenACC (with PGI's compiler) and trying to optimize matrix multiplication example. The fastest implementation I came up so far is the following:

void matrix_mul(float *restrict r, float *a, float *b, int N, int accelerate){

#pragma acc data copyin (a[0: N * N ], b[0: N * N]) copyout (r [0: N * N ]) if(accelerate)
{
# pragma acc region if(accelerate)
{
# pragma acc loop independent vector(32) 
for (int j = 0; j < N; j ++)
{    
   # pragma acc loop independent vector(32) 
   for (int i = 0; i < N ; i ++ )
   {
      float sum = 0;
      for (int k = 0; k < N ; k ++ ) {
         sum += a [ i + k*N ] * b [ k + j * N ];
      }
      r[i + j * N ] = sum ;
   }
}
}
}

This results in thread blocks of size 32x32 threads and gives me the best performance so far. Here are the benchmarks:

Matrix multiplication (1500x1500): 
GPU: Geforce GT650 M, 64-bit Linux 

Data sz             : 1500     
Unaccelerated:
     matrix_mul() time    : 5873.255333 msec
Accelerated:
     matrix_mul() time    : 420.414700 msec

Data size             : 1750 x 1750     
    matrix_mul() time    : 876.271200 msec
Data size             : 2000 x 2000     
    matrix_mul() time    : 1147.783400 msec
Data size             : 2250 x 2250     
    matrix_mul() time    : 1863.458100 msec
Data size             : 2500 x 2500     
    matrix_mul() time    : 2516.493200 msec

Unfortunately I realized that the generated CUDA code is quite primitive (e.g. it does not even use shared memory) and hence cannot compete with hand-optimized CUDA program. As a reference implementation I took Arrayfire lib with the following results:

Arrayfire 1500 x 1500 matrix mul
CUDA toolkit 4.2, driver 295.59
GPU0 GeForce GT 650M, 2048 MB, Compute 3.0 (single,double)
Memory Usage: 1932 MB free (2048 MB total)
af:  0.03166 seconds

Arrayfire 1750 x 1750 matrix mul
 af:  0.05042 seconds
Arrayfire 2000 x 2000 matrix mul
 af:  0.07493 seconds
Arrayfire 2250 x 2250 matrix mul
 af:  0.10786 seconds
Arrayfire 2500 x 2500 matrix mul
 af:  0.14795 seconds

I wonder if there any suggestions how to get better performance from OpenACC ? Perhaps my choice of directives is not right ?

You're getting right at a 14x speedup, which is pretty good for PGI's compiler in my experience.

First off, are you compiling with -Minfo? That will give you a lot of feedback from the compiler regarding optimization choices.

You are using a 32x32 thread block, but in my experience 16x16 thread blocks tend to get better performance. If you omit the vector(32) clauses, what scheduling does the compiler choose?

Declaring a and b with restrict might let the compiler generate better code.

Just by looking at your code, I'm not sure that shared memory would help performance. Shared memory only helps improve performance if your code can store and reuse values there instead of going to global memory. In this case you're not reusing any part of a or b after reading it.

It's also worth noting that I've had bad experiences with PGI's compiler when it comes to shared memory usage. It will sometimes do funny stuff and cache the wrong values (seems to mostly happen if you iterate a loop backward), generating wrong results. I actually have to compile my current application using the undocumented -ta=nvidia,nocache option to get it to work correctly, by bypassing shared memory usage altogether.

Recommended topics

Hot tags