how to optimize matrix multiplication using OpenACC?
Asked Answered
D

1

6

I am learning OpenACC (with PGI's compiler) and trying to optimize matrix multiplication example. The fastest implementation I came up so far is the following:

void matrix_mul(float *restrict r, float *a, float *b, int N, int accelerate){

#pragma acc data copyin (a[0: N * N ], b[0: N * N]) copyout (r [0: N * N ]) if(accelerate)
{
# pragma acc region if(accelerate)
{
# pragma acc loop independent vector(32) 
for (int j = 0; j < N; j ++)
{    
   # pragma acc loop independent vector(32) 
   for (int i = 0; i < N ; i ++ )
   {
      float sum = 0;
      for (int k = 0; k < N ; k ++ ) {
         sum += a [ i + k*N ] * b [ k + j * N ];
      }
      r[i + j * N ] = sum ;
   }
}
}
}

This results in thread blocks of size 32x32 threads and gives me the best performance so far. Here are the benchmarks:

Matrix multiplication (1500x1500): 
GPU: Geforce GT650 M, 64-bit Linux 

Data sz             : 1500     
Unaccelerated:
     matrix_mul() time    : 5873.255333 msec
Accelerated:
     matrix_mul() time    : 420.414700 msec

Data size             : 1750 x 1750     
    matrix_mul() time    : 876.271200 msec
Data size             : 2000 x 2000     
    matrix_mul() time    : 1147.783400 msec
Data size             : 2250 x 2250     
    matrix_mul() time    : 1863.458100 msec
Data size             : 2500 x 2500     
    matrix_mul() time    : 2516.493200 msec

Unfortunately I realized that the generated CUDA code is quite primitive (e.g. it does not even use shared memory) and hence cannot compete with hand-optimized CUDA program. As a reference implementation I took Arrayfire lib with the following results:

Arrayfire 1500 x 1500 matrix mul
CUDA toolkit 4.2, driver 295.59
GPU0 GeForce GT 650M, 2048 MB, Compute 3.0 (single,double)
Memory Usage: 1932 MB free (2048 MB total)
af:  0.03166 seconds

Arrayfire 1750 x 1750 matrix mul
 af:  0.05042 seconds
Arrayfire 2000 x 2000 matrix mul
 af:  0.07493 seconds
Arrayfire 2250 x 2250 matrix mul
 af:  0.10786 seconds
Arrayfire 2500 x 2500 matrix mul
 af:  0.14795 seconds

I wonder if there any suggestions how to get better performance from OpenACC ? Perhaps my choice of directives is not right ?

Deegan answered 3/8, 2012 at 8:13 Comment(2)
This problem illustrates the different approach of Compiler Directives vs CUDA/OpenCL. CUDA/OpenCL is much closer to the H/W; where you can optimize and tweak for a H/W platform. You could unroll the inner loop computing 2,4, or 8, ... Sums thus reducing the number of inner loopsShiverick
huh good idea, thanks.. Yes I know, CUDA/OpenCL can be considered "low-level" APIs, I myself is from the old CUDA school. On the other hand, OpenACC has more potential in the future because it's not limited to GPU only and of course development costs. Still, it would be nice if OpenACC compilers can exploit GPU's shared memory for computations: I know there is OpenACC 'cache' directive but I could not manage to make it workingDeegan
S
5

You're getting right at a 14x speedup, which is pretty good for PGI's compiler in my experience.

First off, are you compiling with -Minfo? That will give you a lot of feedback from the compiler regarding optimization choices.

You are using a 32x32 thread block, but in my experience 16x16 thread blocks tend to get better performance. If you omit the vector(32) clauses, what scheduling does the compiler choose?

Declaring a and b with restrict might let the compiler generate better code.

Just by looking at your code, I'm not sure that shared memory would help performance. Shared memory only helps improve performance if your code can store and reuse values there instead of going to global memory. In this case you're not reusing any part of a or b after reading it.

It's also worth noting that I've had bad experiences with PGI's compiler when it comes to shared memory usage. It will sometimes do funny stuff and cache the wrong values (seems to mostly happen if you iterate a loop backward), generating wrong results. I actually have to compile my current application using the undocumented -ta=nvidia,nocache option to get it to work correctly, by bypassing shared memory usage altogether.

Sible answered 13/8, 2012 at 11:54 Comment(1)
yes I tried out 16x16 case but it actually runs slower. I assume this is precisely because no shared memory is used. Hence, the more threads we get per block the more the greater is the effect of "caching" intermediate results in registers. There is indeed a way how shared memory can help performance if you have a look at matrix multiplication example in CUDA SDK. If I remove vector(32) clauses, the compiler just vectorizes by rows of the matrix (not by 2D tiles) and the performance drops. Anyway thanks for a good adviceDeegan

© 2022 - 2024 — McMap. All rights reserved.