Unable to execute MPICH2 on multiple machines on ubuntu 12.04 (HYDU_sock_connect issue)
Asked Answered
R

2

0

I am facing difficulty in executing MPI program on two machines. The OS is Ubuntu 12.04. And the MPI implementation is MPICH2

ssh is working fine:

  root@ubuntu:/home# ssh 192.168.1.9
root@gpuguy's password: 
Welcome to Ubuntu 12.04.3 LTS (GNU/Linux 3.8.0-29-generic i686)

 * Documentation:  https://help.ubuntu.com/

131 packages can be updated.
67 updates are security updates.

Last login: Thu Oct 24 17:36:25 2013 from ubuntu.local
root@gpuguy:~# 

But when I run my MPI programs it fails:

root@ubuntu:/home# mpiexec -f hosts.cfg -n 4 hello
[email protected]'s password:
[proxy:0:0@gpuguy] HYDU_sock_connect (./utils/sock/sock.c:171): unable to get host address for ubuntu (1)
[proxy:0:0@gpuguy] main (./pm/pmiserv/pmip.c:209): unable to connect to server ubuntu at port 42104 (check for firewalls!)

I have already disabled firewall on both machines that is the reason I can do ssh successfully. But how to solve this issue?

My MPI code runs successfully on single machine.

Rozele answered 24/10, 2013 at 12:20 Comment(0)
F
2

For MPICH (or any MPI implementation) to work, you need to have passwordless SSH set up. I should also mention that you really shouldn't have to be logged in as root to make this work. It's generally a very bad idea to be logged in as root all of the time.

Farrar answered 24/10, 2013 at 14:22 Comment(2)
i have setup passwordless ssh but when i run mpirun command i get an error message "[proxy:0:0@gauss-mic0] HYDU_sock_connect (./utils/sock/sock.c:264): unable to connect from "gauss-mic0" to "127.0.1.1" (Connection refused) [proxy:0:0@gauss-mic0] main (./pm/pmiserv/pmip.c:396): unable to connect to server 127.0.1.1 at port 42947 (check for firewalls!) "Staggard
If you have another question, you'll need to post it separately rather than trying to do everything through the comments.Farrar
U
0

In /etc/hosts file, add ip address of each server and its hostname. You should do this for all the servers.

for example:

10.10.0.5    server1
10.10.0.6    server2
10.10.0.7    server3

Just check in /etc/hosts file, not use tab (\t) instead of space to separate between ip address and hostname.

This is wrong:

10.10.0.5 \t server1

This is true:

10.10.0.5    server1

Be careful to not delete or modify existed lines in /etc/hosts file. only add new lines at end of file.

Also, you do not need to disable firewall to fix this issue.

Unhand answered 21/9, 2021 at 12:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.