infinite wait during openMPI run on a cluster of servers?
Asked Answered
C

1

0

I have successfully set up the password less ssh between the servers and my computer. There is a simple openMPI program which is running well on the single computer. But ,unfortunately when i am trying this on a cluster ,neither i am getting a password prompt(as i have set up ssh authorization) nor the execution is moving forward.

Hostfile looks like this,

# The Hostfile for Open MPI

# The master node, 'slots=8' is used because it has 8 cores
  localhost slots=8
# The following slave nodes are single processor machines:
  [email protected] slots=8 
  gautam@srvgrm04 slots=160

I am running hello world MPI program on the cluster,

int main(int argc, char *argv[]) {
  int numprocs, rank, namelen;
  char processor_name[MPI_MAX_PROCESSOR_NAME]; 
  double t;
  MPI_Init(&argc, &argv);
  t=MPI_Wtime();    
  MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Get_processor_name(processor_name, &namelen);

  printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);
  MPI_Finalize();
}

and i am running like this mpirun -np 16 --hostfile hostfile ./hello

when using -d option, the log is like this,

[gautam@pcys33:~/LTE/check ]% mpirun -np 16 --hostfile hostfile -d ./hello
[pcys33.grm.polymtl.ca:02686] procdir: /tmp/[email protected]_0/60067/0/0
[pcys33.grm.polymtl.ca:02686] jobdir: /tmp/[email protected]_0/60067/0
[pcys33.grm.polymtl.ca:02686] top: [email protected]_0
[pcys33.grm.polymtl.ca:02686] tmp: /tmp
[srvgrm04:77812] procdir: /tmp/openmpi-sessions-gautam@srvgrm04_0/60067/0/1
[srvgrm04:77812] jobdir: /tmp/openmpi-sessions-gautam@srvgrm04_0/60067/0
[srvgrm04:77812] top: openmpi-sessions-gautam@srvgrm04_0
[srvgrm04:77812] tmp: /tmp

can you make a inference from the logs ?

Crosshead answered 11/7, 2013 at 22:22 Comment(6)
Maybe try the -d to mpirun to get some idea what's happening.Beacon
i edited to contain the log when i tried -d option with the run !Crosshead
Are you sure that hello exists on all nodes and is located in the same filesystem path? Apparently the ORTE daemon is launching successfully on the second node, although the absence of pcys13.grm.polymtl.ca in the log could indicate that there is a problem connecting to it (or is it an alias for srvgrm04?) BTW, you don't have to specify the usernames in the hostfile if they are the same as the one on the master host.Waisted
since every node has the same file system with the same authentication, i think hello will exist on all of them.I have password less ssh enabled and can access the other computers via ssh . I have also tried with the hostfile not having the username with the corresponding node.Crosshead
Do i assumed to change anything on the code for it to be running on cluster of servers?I used 32 process on a single server and works well.Or if there is anything to be specified for load balance between the nodes ? Please helpCrosshead
i have got some conclusions regarding the problem.Can you please have a look on that ? #17820945Crosshead
C
-1

You just need to disable the firewall of each machine

Crosscrosslet answered 17/12, 2013 at 5:40 Comment(1)
This should be a commentNutritionist

© 2022 - 2024 — McMap. All rights reserved.