R: making cluster in doParallel / snowfall hangs
Asked Answered
W

1

4

I've got two servers on a LAN with fresh installs of Centos 6.4 minimal and R 3.0.1. Both computers have doParallel, snow, and snowfall packages installed.

The servers can ssh to each other fine.

When I attempt to make clusters in either direction, I get a prompt for a password, but after entering the password, it just hangs there indefinately.

makePSOCKcluster("192.168.1.1",user="username")

How can I troubleshoot this?

edit:

I also tried calling makePSOCKcluster on the above-mentioned computer with a host that IS capable of being used as a slave (from other computers), but it still hangs. So, is it possible there is a firewall issue? I also tried using makePSOCKcluster with port 22:

> makePSOCKcluster("192.168.1.1",user="username",port=22)
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE,  :
  cannot open the connection
In addition: Warning message:
In socketConnection("localhost", port = port, server = TRUE, blocking = TRUE,  :
  port 22 cannot be opened

here's my iptables

# Firewall configuration written by system-config-firewall
# Manual customization of this file is not recommended.
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
-A INPUT -j REJECT --reject-with icmp-host-prohibited
-A FORWARD -j REJECT --reject-with icmp-host-prohibited
COMMIT
Waiver answered 29/7, 2013 at 11:50 Comment(1)
You need to be root to bind to low number ports, and you can't bind to ports that are already bound to another process such as sshd.Alb
A
8

You could start by setting the "outfile" option to an empty string when creating the cluster object:

makePSOCKcluster("192.168.1.1",user="username",outfile="")

This allows you to see error messages from the workers in your terminal, which will hopefully provide a clue to the problem. If that doesn't help, I recommend using manual mode:

makePSOCKcluster("192.168.1.1",user="username",outfile="",manual=TRUE)

This bypasses ssh, and displays commands for you to execute in order to manually start each of the workers in separate terminals. This can uncover problems such as R packages that are not installed. It also allows you to debug the workers using whatever debugging tools you choose, although that takes a bit of work.

If makePSOCKcluster doesn't respond after you execute the specified command, it means that the worker wasn't able to connect to the master process. If the worker doesn't display any error message, it may indicate a networking problem, possibly due to a firewall blocking the connection. Since makePSOCKcluster uses a random port by default in R 3.X, you should specify an explicit value for port and configure your firewall to allow connections to that port.

To test for networking or firewall problems, you could try connecting to the master process using "netcat". Execute makePSOCKcluster in manual mode, specifying the hostname of the desired worker host and the port on local machine that should allow incoming connections:

> library(parallel)
> makePSOCKcluster("node03", port=11234, manual=TRUE)
Manually start worker on node03 with
   '/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=node01
PORT=11234 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE 

Now start a terminal session on "node03" and execute "nc" using the indicated values of "MASTER" and "PORT" as arguments:

node03$ nc node01 11234

The master process should immediately return with the message:

socket cluster with 1 nodes on host ‘node03’

while netcat should display no message, since it is quietly reading from the socket connection.

However, if netcat displays the message:

nc: getaddrinfo: Name or service not known

then you have a hostname resolution problem. If you can find a hostname that does work with netcat, you may be able to get makePSOCKcluster to work by specifying that name via the "master" option: makePSOCKcluster("node03", master="node01", port=11234).

If netcat returns immediately, that may indicate that it wasn't able to connect to the specified port. If it returns after a minute or two, that may indicate that it wasn't able to communicate with specified host at all. In either case, check netcat's return value to verify that it was an error:

node03$ echo $?
1

Hopefully that will give you enough information about the problem that you can get help from a network administrator.

Alb answered 29/7, 2013 at 13:39 Comment(7)
Thanks. I've tried with passwordless-ssh with no luck. When using makePSOCKcluster with manual=TRUE, it tells me to run '/usr/lib64/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=genome PORT=11494 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE which I do on the slave, but nothing happens after that. Putting more clues in OP.Waiver
When I disable my firewall (iptables) on both master and slave, I get the same result, even when manual=TRUE.Waiver
Thanks, it looks like I'm getting somewhere! selinux and iptables are disabled on both master and slave. I ran 'makePSOCKcluster("192.168.1.1", port=11234)' on the master, which hangs UNTIL I execute 'nc 192.168.1.2 11234' on the slave. Immediately after calling nc on the slave, the following appears on the master " socket cluster with 1 nodes on host '192.168.1.1' ". (192.168.1.2 is the master, 1.1 is slave). So, does this mean the slave is not listening on the port until its told to? How may I start socket clusters without using nc on the slave? Thanks again.Waiver
@user1489048 Unfortunately, the actually worker process didn't start: you tricked the master into thinking that it started when you connected to it with nc. Question: does the master display "MASTER=192.168.1.2" when using manual mode? If not, setting "master=192.168.1.2" might help.Alb
@user1489048 In other words, try running 'makePSOCKcluster("192.168.1.1", master="192.168.1.2", port=11234)'.Alb
defining master as ip rather than hostname worked. Thank you so much!Waiver
Forcing a port (11000) and making ITOps open it on both servers worked for me. ThanksMehetabel

© 2022 - 2024 — McMap. All rights reserved.