How to measure overall performance of parallel programs (with papi)

I asked myself what would be the best way to measure the performance (in flops) of a parallel program. I read about papi_flops. This seems to work fine for a serial program. But I don't know how I can measure the overall performance of a parallel program.

I would like to measure the performance of a blas/lapack function, in my example below gemm. But I also want to measure other function, specially functions where the number of operation is not known. (In the case of gemm the ops are known (ops(gemm) = 2*n^3), so I could calculate the performance as a function of the number of operations and the execution time.) The library (I am using Intel MKL) spawn the threads automatically. So I can't measure the performance of each thread individually and then reduce it.

This is my example:

#include <stdlib.h>                                                              
#include <stdio.h>                                                               
#include <string.h>                                                             
#include "mkl.h"
#include "omp.h"
#include "papi.h"       

int main(int argc, char *argv[] )                                                
{                                                                                
  int i, j, l, k, n, m, idx, iter;
  int mat, mat_min, mat_max;
  int threads;
  double *A, *B, *C;
  double alpha =1.0, beta=0.0;

  float rtime1, rtime2, ptime1, ptime2, mflops;
  long long flpops;

  #pragma omp parallel
  {
    #pragma omp master
    threads = omp_get_num_threads();
  }

  if(argc < 4){                                                                  
    printf("pass me 3 arguments!\n");                                            
    return( -1 );                                                                
  }                                                                              
  else                                                                           
  {                                                                            
    mat_min = atoi(argv[1]);
    mat_max = atoi(argv[2]);
    iter = atoi(argv[3]);                                                         
  }                    

  m = mat_max;  n = mat_max;  k = mat_max;

  printf (" Initializing data for matrix multiplication C=A*B for matrix \n"
            " A(%ix%i) and matrix B(%ix%i)\n\n", m, k, k, n);

  A = (double *) malloc( m*k * sizeof(double) );
  B = (double *) malloc( k*n * sizeof(double) );
  C = (double *) malloc( m*n * sizeof(double) );

  printf (" Intializing matrix data \n\n");
  for (i = 0; i < (m*k); i++)
    A[i] = (double)(i+1);
  for (i = 0; i < (k*n); i++)
    B[i] = (double)(-i-1);
  memset(C,0,m*n*sizeof(double));

  // actual meassurment
  for(mat=mat_min;mat<=mat_max;mat+=5)
  {
    m = mat;  n = mat; k = mat;

    for( idx=-1; idx<iter; idx++ ){
      PAPI_flops( &rtime1, &ptime1, &flpops, &mflops );
      cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, 
                    m, n, k, alpha, A, k, B, n, beta, C, n);
      PAPI_flops( &rtime2, &ptime2, &flpops, &mflops );
    }

    printf("%d threads: %d in %f sec, %f MFLOPS\n",threads,mat,rtime2-rtime1,mflops);fflush(stdout);
  }

  printf("Done\n");fflush(stdout);

  free(A);
  free(B);
  free(C);

  return 0;
}

This is one output (for matrix size 200):

1 threads: 200 in 0.001459 sec, 5570.258789 MFLOPS
2 threads: 200 in 0.000785 sec, 5254.993652 MFLOPS
4 threads: 200 in 0.000423 sec, 4919.640137 MFLOPS
8 threads: 200 in 0.000264 sec, 3894.036865 MFLOPS

We can see for the execution time, that the function gemm scales. But the flops that I am measuring is only the performance of thread 0.

My question is: How can I measure the overall performance? I am grateful for any input.

First, I'm just curious - why do you need the FLOPS? don't you just care how much time is taken? or maybe time taken in compare to other BLAS libraries?

PAPI is thread based not much help on its own here.

What I would do is measure around the function call and see how time changes with number of threads it spawns. It should not spawn more threads than physical cores (HT is no good here). Then, if the matrix is big enough, and the machine is not loaded, the time should simply divide by the number of threads. E.g., 10 seconds over 4 core should become 2.5 seconds.

Other than that, there are 2 things you can do to really measure it:
1. Use whatever you use now but inject your start/end measurement code around the BLAS code. One way to do that (in linux) is by pre-loading a lib that defines pthread_start and using your own functions that call the originals but do some extra measurements. Another way to to override the function pointer when the process is already running (=trampoline). In linux it's in the GOT/PLT and in windows it's more complicated - look for a library.
2. Use oprofile, or some other profiler, to report number of instructions executed in the time you care for. Or better yet, to report the number of floating point instructions executed. A little problem with this is that SSE instructions are multiplying or adding 2 or more doubles at a time so you'd have to account for that. I guess you can assume they always use the maximum possible operands.

Recommended topics

Hot tags