How to measure overall performance of parallel programs (with papi)
Asked Answered
W

1

9

I asked myself what would be the best way to measure the performance (in flops) of a parallel program. I read about papi_flops. This seems to work fine for a serial program. But I don't know how I can measure the overall performance of a parallel program.

I would like to measure the performance of a blas/lapack function, in my example below gemm. But I also want to measure other function, specially functions where the number of operation is not known. (In the case of gemm the ops are known (ops(gemm) = 2*n^3), so I could calculate the performance as a function of the number of operations and the execution time.) The library (I am using Intel MKL) spawn the threads automatically. So I can't measure the performance of each thread individually and then reduce it.

This is my example:

#include <stdlib.h>                                                              
#include <stdio.h>                                                               
#include <string.h>                                                             
#include "mkl.h"
#include "omp.h"
#include "papi.h"       

int main(int argc, char *argv[] )                                                
{                                                                                
  int i, j, l, k, n, m, idx, iter;
  int mat, mat_min, mat_max;
  int threads;
  double *A, *B, *C;
  double alpha =1.0, beta=0.0;

  float rtime1, rtime2, ptime1, ptime2, mflops;
  long long flpops;

  #pragma omp parallel
  {
    #pragma omp master
    threads = omp_get_num_threads();
  }

  if(argc < 4){                                                                  
    printf("pass me 3 arguments!\n");                                            
    return( -1 );                                                                
  }                                                                              
  else                                                                           
  {                                                                            
    mat_min = atoi(argv[1]);
    mat_max = atoi(argv[2]);
    iter = atoi(argv[3]);                                                         
  }                    

  m = mat_max;  n = mat_max;  k = mat_max;

  printf (" Initializing data for matrix multiplication C=A*B for matrix \n"
            " A(%ix%i) and matrix B(%ix%i)\n\n", m, k, k, n);

  A = (double *) malloc( m*k * sizeof(double) );
  B = (double *) malloc( k*n * sizeof(double) );
  C = (double *) malloc( m*n * sizeof(double) );

  printf (" Intializing matrix data \n\n");
  for (i = 0; i < (m*k); i++)
    A[i] = (double)(i+1);
  for (i = 0; i < (k*n); i++)
    B[i] = (double)(-i-1);
  memset(C,0,m*n*sizeof(double));

  // actual meassurment
  for(mat=mat_min;mat<=mat_max;mat+=5)
  {
    m = mat;  n = mat; k = mat;

    for( idx=-1; idx<iter; idx++ ){
      PAPI_flops( &rtime1, &ptime1, &flpops, &mflops );
      cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, 
                    m, n, k, alpha, A, k, B, n, beta, C, n);
      PAPI_flops( &rtime2, &ptime2, &flpops, &mflops );
    }

    printf("%d threads: %d in %f sec, %f MFLOPS\n",threads,mat,rtime2-rtime1,mflops);fflush(stdout);
  }

  printf("Done\n");fflush(stdout);

  free(A);
  free(B);
  free(C);

  return 0;
}

This is one output (for matrix size 200):

1 threads: 200 in 0.001459 sec, 5570.258789 MFLOPS
2 threads: 200 in 0.000785 sec, 5254.993652 MFLOPS
4 threads: 200 in 0.000423 sec, 4919.640137 MFLOPS
8 threads: 200 in 0.000264 sec, 3894.036865 MFLOPS

We can see for the execution time, that the function gemm scales. But the flops that I am measuring is only the performance of thread 0.

My question is: How can I measure the overall performance? I am grateful for any input.

Waterside answered 29/7, 2015 at 13:21 Comment(3)
Umm.. Measure flops for each thread and then add them together?Middlebrooks
How can I do this? The blas library create the threads. So, the parallel region is inside the function call dgemm. I don't have access to the individual threads. Of course I could recompile the blas library and then inside the parallel region measure the performance for each thread (not possible in the case of MKL, okay I could switch to OpenBlas). But this is what I want to avoid.Waterside
Could you show the number of flops? Maybe mflops is an average across all threads?Collinsworth
P
4

First, I'm just curious - why do you need the FLOPS? don't you just care how much time is taken? or maybe time taken in compare to other BLAS libraries?

PAPI is thread based not much help on its own here.

What I would do is measure around the function call and see how time changes with number of threads it spawns. It should not spawn more threads than physical cores (HT is no good here). Then, if the matrix is big enough, and the machine is not loaded, the time should simply divide by the number of threads. E.g., 10 seconds over 4 core should become 2.5 seconds.

Other than that, there are 2 things you can do to really measure it:
1. Use whatever you use now but inject your start/end measurement code around the BLAS code. One way to do that (in linux) is by pre-loading a lib that defines pthread_start and using your own functions that call the originals but do some extra measurements. Another way to to override the function pointer when the process is already running (=trampoline). In linux it's in the GOT/PLT and in windows it's more complicated - look for a library.
2. Use oprofile, or some other profiler, to report number of instructions executed in the time you care for. Or better yet, to report the number of floating point instructions executed. A little problem with this is that SSE instructions are multiplying or adding 2 or more doubles at a time so you'd have to account for that. I guess you can assume they always use the maximum possible operands.

Peggie answered 3/8, 2015 at 14:46 Comment(4)
First of all: Thank you for your answer! Why do I want to measure performance and execution time? I am actually interested in analyzing LAPACK's dense eigensolver. The dense eigensolvers calls three function: 1) reduction to tridiagonal form, 2) tridiagonal eigensolver, 3) backtransformation. To identify the bottlenecks of dense eigensolver it is necessary to measure time and performance. If I only have the execution time, then e.g. I could see that I spend most the time in the reduction. But I don't know if I use the resources efficiently. So I can not be sure if this is the bottleneck.Waterside
You have suggested two Variants for this problem. I like the first one. Overwriting the pthread_create (and also pthread_join) seems to be the only why to work with PAPI. Overwriting the pointer at runtime make sense for my case (I have a lot of correctness checking in my code, I do not want to measure also this part).Waterside
I could understand the theory, but I am not sure how I could implement this. I would have to overwrite the function pointer to pthread_create. Inside this function I have to create the thread with the original pthread_create function and after this I have to start the measuring. I am not really sure how to resolve the problem with the overwritten pointer and the original pointer. My idea for this are macros. Is this the best way? In general: Do you have an example or do you have a recommended reading advise for learning more about this? Thanks!Waterside
I'll post an example when I'm in front of a desktop computer - in two weeksPeggie

© 2022 - 2024 — McMap. All rights reserved.