I am moving a program parallelized by OpenMP to Cluster. The cluster is using Lava 1.0 as scheduler and has 8 cores in each nodes. I used a MPI wrapper in the job script to do multi-host parallel.
Here is the job script:
#BSUB -q queue_name
#BSUB -x
#BSUB -R "span[ptile=1]"
#BSUB -n 1
#BSUB -J n1p1o8
##BSUB -o outfile.email
#BSUB -e err
export OMP_NUM_THREADS=8
date
/home/apps/bin/lava.openmpi.wrapper -bynode -x OMP_NUM_THREADS \
~/my_program ~/input.dat ~/output.out
date
I did some experiments on ONE host exclusively. However, I don't know how to explain some of the results.
1.
-nOMP_NUM_THREADStime
1 4 21:12
2 4 20:12
Does it mean MPI doesn't do any parallel here? I thought in second case every MPI process would have 4 OMP threads so it should use 800% CPU usage which should be faster than first one.
Another results to prove it is that
-nOMP_NUM_THREADStime
2 2 31:42
4 2 30:47
They also have pretty close run times.
2.
In this case, if I want to parallel this program in this cluster with reasonable optimized speed by simple way, is it reasonable to put 1 MPI process (tell LFG that I use one core) in every host, set OMP_NUM_THREADS = 8, and then run it exclusively? Therefore MPI only works on cross-node jobs and OpenMP works on inner node jobs. (-n = # of host; ptile = 1; OMP_NUM_THREADS = Max cores in each host)
UPDATE: The program is compiled by gfortran -fopenmp without mpicc. MPI is only used to distribute the executable.
UPDATE Mar.3: Program memory usage monitor
Local environment: Mac 10.8 / 2.9 Ghz i7 /8GB Memory
No OpenMP
- Real memory size: 8.4 MB
- Virtual memory size: 2.37 GB
- Shared Memory Size: 212 KB
- Private Memory Size: 7.8 Mb
- Virtual Private Memory: 63.2 MB
With OpenMP (4 threads)
- Real memory size: 31.5 MB
- Virtual memory size: 2.52 GB
- Shared Memory Size: 212 KB
- Private Memory Size: 27.1 Mb
- Virtual Private Memory: 210.2 MB
Cluster hardware brief info
Each host contains dual quad chips which is 8 cores per node and 8GB memory. The hosts in this cluster are connected by infiniband.