How to load balance a simple loop using MPI in C++
Asked Answered
G

1

6

I am writing some code which is computationally expensive, but highly parallelisable. Once parallelised, I intend to run it on a HPC, however to keep the runtime down to within a week, the problem needs to scale well, with the number of processors.

Below is a simple and ludicrous example of what I am attempting to achieve, which is concise enough to compile and demonstrate my problem;

#include <iostream>
#include <ctime>
#include "mpi.h"

using namespace std;

double int_theta(double E){
    double result = 0;
    for (int k = 0; k < 20000; k++)
        result += E*k;
    return result;
}

int main() 
{
    int n = 3500000;
    int counter = 0;
    time_t timer;
    int start_time = time(&timer);
    int myid, numprocs;
    int k;
    double integrate, result;
    double end = 0.5;
    double start = -2.;
    double E;
    double factor = (end - start)/(n*1.);
    integrate = 0;
    MPI_Init(NULL,NULL);
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myid);
    for (k = myid; k<n+1; k+=numprocs){
        E = start + k*(end-start)/n;
        if (( k == 0 ) || (k == n))
            integrate += 0.5*factor*int_theta(E);
        else
            integrate += factor*int_theta(E);
        counter++;
    }
    cout<<"process "<<myid<<" took "<<time(&timer)-start_time<<"s"<<endl;
    cout<<"process "<<myid<<" performed "<<counter<<" computations"<<endl;
    MPI_Reduce(&integrate, &result, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
    if (myid == 0)
        cout<<result<<endl;
    MPI_Finalize();
    return 0;
}

I have compiled the problem on my quadcore laptop with

mpiicc test.cpp -std=c++14 -O3 -DMKL_LP64 -lmkl_intel_lp64 - lmkl_sequential -lmkl_core -lpthread -lm -ldl

and I get the following output;

$ mpirun -np 4 ./a.out
process 3 took 14s
process 3 performed 875000 computations
process 1 took 15s
process 1 performed 875000 computations
process 2 took 16s
process 2 performed 875000 computations
process 0 took 16s
process 0 performed 875001 computations
-3.74981e+08

$ mpirun -np 3 ./a.out 
process 2 took 11s
process 2 performed 1166667 computations
process 1 took 20s
process 1 performed 1166667 computations
process 0 took 20s
process 0 performed 1166667 computations
-3.74981e+08

$ mpirun -np 2 ./a.out 
process 0 took 16s
process 0 performed 1750001 computations
process 1 took 16s
process 1 performed 1750000 computations
-3.74981e+08

To me it appears that there must be a barrier somewhere that I am not aware of. I get better performance with 2 processors over 3. Please can somebody offer any advice? Thanks

Garrulity answered 7/4, 2019 at 19:45 Comment(2)
In the code I do not see any obvious issue - looks like being embarassignly parallel. Does your laptop really have four hardware cores (instead of four logical cores, i.e. two hardware cores and hyperthreading)? Which operating system are you using? Are the four individual MPI processes pinned to their individual cores or might they be moving around? Is there any other work going on in the background?Borges
@Borges thanks for the quick response. To answer your question, lscpu tells me CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 I am using 64 bit Linux Mint. I believe the MPI processes are pinned as I have not changed the default behaviour (which is pinned). There is no other work other than the usual system stuff. I thought something was odd as the runtime for 2 processes was equal to that of 4, and 3 performed the worst, but your response suggests that's due to the hardware I'm using?Garrulity
B
1

If I read the output of lscpu you gave correctly (e.g. with the help of https://unix.stackexchange.com/a/218081), you are having 4 logical CPUs, but only 2 hardware cores (1 socket x 2 cores per socket). Using cat /proc/cpuinfo you can finde the make and model for the CPU to maybe find out more.

The four logical CPUs might result from hyperthreading, which means that some hardware resources (e.g. the FPU unit, but I am not an expert on this) are shared between two cores. Thus, I would not expect any good parallel scaling beyond two processes.

For scalability tests, you should try to get your hands on a machine with maybe 6 or more hardware cores do get a better estimate.

From looking at your code, I would expect perfect scalability to any number of cores - At least as long as you do not include the time needed for process startup and the final MPI_Reduce. These will for sure become slower with more processes involved.

Borges answered 8/4, 2019 at 13:17 Comment(2)
Many thanks for your answer, I have since ran the above code on a cluster and have found that I get good scalability on one core (with 32 processes). For some reason the code appears to hang when multiple nodes are used, but I have got in touch with the administrator in the hope that this is down to the code needing to be compiled in a certain way, or submitted to the bsub queue in a particular way. Presumably it's not the way i've written the code?Garrulity
Yes, that might be a machine-specific issue. Reasons are difficult to guess, but surely, your admin can help you with that.Borges

© 2022 - 2024 — McMap. All rights reserved.