I am writing some code which is computationally expensive, but highly parallelisable. Once parallelised, I intend to run it on a HPC, however to keep the runtime down to within a week, the problem needs to scale well, with the number of processors.
Below is a simple and ludicrous example of what I am attempting to achieve, which is concise enough to compile and demonstrate my problem;
#include <iostream>
#include <ctime>
#include "mpi.h"
using namespace std;
double int_theta(double E){
double result = 0;
for (int k = 0; k < 20000; k++)
result += E*k;
return result;
}
int main()
{
int n = 3500000;
int counter = 0;
time_t timer;
int start_time = time(&timer);
int myid, numprocs;
int k;
double integrate, result;
double end = 0.5;
double start = -2.;
double E;
double factor = (end - start)/(n*1.);
integrate = 0;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
for (k = myid; k<n+1; k+=numprocs){
E = start + k*(end-start)/n;
if (( k == 0 ) || (k == n))
integrate += 0.5*factor*int_theta(E);
else
integrate += factor*int_theta(E);
counter++;
}
cout<<"process "<<myid<<" took "<<time(&timer)-start_time<<"s"<<endl;
cout<<"process "<<myid<<" performed "<<counter<<" computations"<<endl;
MPI_Reduce(&integrate, &result, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
if (myid == 0)
cout<<result<<endl;
MPI_Finalize();
return 0;
}
I have compiled the problem on my quadcore laptop with
mpiicc test.cpp -std=c++14 -O3 -DMKL_LP64 -lmkl_intel_lp64 - lmkl_sequential -lmkl_core -lpthread -lm -ldl
and I get the following output;
$ mpirun -np 4 ./a.out
process 3 took 14s
process 3 performed 875000 computations
process 1 took 15s
process 1 performed 875000 computations
process 2 took 16s
process 2 performed 875000 computations
process 0 took 16s
process 0 performed 875001 computations
-3.74981e+08
$ mpirun -np 3 ./a.out
process 2 took 11s
process 2 performed 1166667 computations
process 1 took 20s
process 1 performed 1166667 computations
process 0 took 20s
process 0 performed 1166667 computations
-3.74981e+08
$ mpirun -np 2 ./a.out
process 0 took 16s
process 0 performed 1750001 computations
process 1 took 16s
process 1 performed 1750000 computations
-3.74981e+08
To me it appears that there must be a barrier somewhere that I am not aware of. I get better performance with 2 processors over 3. Please can somebody offer any advice? Thanks
CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1
I am using 64 bit Linux Mint. I believe the MPI processes are pinned as I have not changed the default behaviour (which is pinned). There is no other work other than the usual system stuff. I thought something was odd as the runtime for 2 processes was equal to that of 4, and 3 performed the worst, but your response suggests that's due to the hardware I'm using? – Garrulity