I am experiencing a problem already raised before here and here, but in a different context.
Suppose I have an R
script file called psock.R
that contains the following
code:
cat("Setup cluster...\n")
cluster <- parallel::makePSOCKcluster(
rep("localhost", 2),
master = "localhost",
port = 11234,
manual = FALSE,
outfile = ""
)
cat("Sleep...\n")
Sys.sleep(1)
cat("Teardown cluster...\n")
parallel::stopCluster(cluster)
When I run the script via Rscript --vanilla psock.R
, everything works as
expected, and I see:
Setup cluster...
starting worker pid=11557 on localhost:11234 at 13:55:52.818
starting worker pid=11556 on localhost:11234 at 13:55:52.818
Sleep...
Teardown cluster...
However, when I try to do the same from within an external R
process, the
parallel::makePSOCKcluster
hangs. For example, suppose psock.R
now contains
the following code instead:
# Create a session.
session <- callr::r_session$new()
cat("Setup cluster...\n")
session$run(function() {
# Create a cluster in the `.GlobalEnv`.
cluster <<- parallel::makePSOCKcluster(
rep("localhost", 2),
master = "localhost",
port = 11234,
manual = FALSE,
outfile = ""
)
})
cat("Sleep...\n")
Sys.sleep(5)
cat("Teardown cluster...\n")
session$run(function() {
# Stop it.
parallel::stopCluster(cluster)
})
# Close session.
session$close()
Running psock.R
now hangs at Setup cluster...
for several minutes and then it exits with an error, i.e.:
Cluster setup failed. 1 worker of 2 failed to connect.
Furthermore, running htop --filter /exec/R
confirms that the first worker
manages to connect, but the second one doesn't. More specifically, I see the
following processes (i.e., numbers added by me):
(1) └─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore --vanilla --file=psock.R
(2) └─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-readline --slave --no-save --no-restore
(3) ├─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore -e tryCatch(parallel:::.workRSOCK,error=function(e)parallel:::.slaveRSOCK)() --args MASTER=localhost PORT=11234 OUT= SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE SETUPSTRATEGY=parallel
where (1) is the psock.R
script call, (2) is the external process created by
callr::r_session$new()
, and (3) is the first worker spawned by
parallel::makePSOCKcluster
that manages to connect.
Following Steve Weston's answers here and here, I tried to do some debugging.
I started by adjusting psock.R
to write the output on the external process to
a file and enabling manual mode by setting manual = TRUE
, i.e.:
session$run(function() {
# Connection.
connection <- file("/some/path/log.txt", open = "wt")
# Write anything to a log file.
sink(connection, append = TRUE)
sink(connection, append = TRUE, type = "message")
# Create a cluster in the `.GlobalEnv`.
cluster <<- parallel::makePSOCKcluster(
rep("localhost", 2),
master = "localhost",
port = 11234,
manual = TRUE,
outfile = ""
)
})
Running Rscript --vanilla psock.R
after the code change above logs the
following command to log.txt
:
Manually start worker on localhost with
'/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'tryCatch(parallel:::.workRSOCK,error=function(e)parallel:::.slaveRSOCK)()' MASTER=localhost PORT=11234 OUT='' SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE SETUPSTRATEGY=sequential
So far, no workers have been created, as can be seen from the output of htop --filter /exec/R
:
└─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore --vanilla --file=psock.R
└─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-readline --slave --no-save --no-restore
Now, I can manually run the command above, twice, to create the workers and connect to the main (i.e., external) process. This results in the following output in the terminal, confirming that the workers have been created and connected:
starting worker pid=13065 on localhost:11234 at 14:59:38.185
starting worker pid=13201 on localhost:11234 at 14:59:38.185
I can also verify this via htop --filter /exec/R
, which now shows the
following:
└─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore --vanilla --file=psock.R
└─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-readline --slave --no-save --no-restore
└─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore -e tryCatch(parallel:::.workRSOCK,error=function(e)parallel:::.slaveRSOCK)() --args MASTER=localhost PORT=11234 OUT= SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE SETUPSTRATEGY=sequential
└─ /Library/Frameworks/R.framework/Resources/bin/exec/R --no-echo --no-restore -e tryCatch(parallel:::.workRSOCK,error=function(e)parallel:::.slaveRSOCK)() --args MASTER=localhost PORT=11234 OUT= SETUPTIMEOUT=120 TIMEOUT=2592000 XDR=TRUE SETUPSTRATEGY=sequential
At this point I am confused because I would expect something to fail. The only
explanation I can think of is that the process I created to start the PSOCK
cluster behaves differently than manually running that command in the terminal.
Perhaps different permissions or missing environment variables? Is this to be
expected?
I also noticed in the command printed by the manual mode that
SETUPSTRATEGY=sequential
is used. However, when manual = FALSE
, the htop
process shows that SETUPSTRATEGY=parallel
is employed. This is, indeed, in
line with the documentation of parallel::makePSOCKcluster
which reads:
If "parallel" (default) workers will be started in parallel during cluster setup when this is possible, which is now for homogeneous "PSOCK" clusters with all workers started automatically (manual = FALSE) on the local machine. Workers will be started sequentially on other clusters, on all clusters with setup_strategy = "sequential" and on R 3.6.0 and older.
Out of curiosity, I tried running the command printed by the manual model, with
the parallel
strategy instead (i.e., SETUPSTRATEGY=parallel
). After
running it only once, the script resumes execution and stops at the
Sys.sleep(5)
part. This is also reflected in the htop
process list, where I can only
see one worker.
Everything works as expected if I disable the manual mode and add
setup_strategy = "sequential"
in the parallel::makePSOCKcluster
function
call. However, I really want to know why it fails with a parallel
setup
strategy. Furthermore, the code runs fine on a Windows machine, but hangs
just the same on a Debian-based system.
In terms of what I'm running, this is what my R
session looks like:
sessionInfo()
# R version 4.2.1 (2022-06-23)
# Platform: aarch64-apple-darwin20 (64-bit)
# Running under: macOS Monterey 12.6
#
# Matrix products: default
# BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# loaded via a namespace (and not attached):
# [1] compiler_4.2.1 cli_3.3.0 jsonlite_1.8.0 rlang_1.0.5
And, since other answers have indicated host-related issues, my /etc/hosts
looks like:
───────┼─────────────────────────────────────────────────────────
1 │ ##
2 │ # Host Database
3 │ #
4 │ # localhost is used to configure the loopback interface
5 │ # when the system is booting. Do not change this entry.
6 │ ##
7 │ 127.0.0.1 localhost
8 │ 255.255.255.255 broadcasthost
9 │ ::1 localhost
───────┴─────────────────────────────────────────────────────────
Also, my .Rprofile
is empty.
Finally, not sure if this is relevant, but when running netstat -i
, the values
in the Address
column for my network interfaces are in the format of IPv6, or
rather IPv4-mapped.
Before I succumb to setup_strategy = "sequential"
, do you have any idea what
is going on?
Edit 1.
Following HenrikB's suggestion, I replaced the parallel::makePSOCKcluster
call with parallelly::makeClusterPSOCK
and before the call I enabled the debug logs options(parallelly.debug = TRUE)
. The code now looks like this:
session$run(function() {
# Enable logging for `parallelly`.
options(parallelly.debug = TRUE)
# Connection.
connection <- file("/some/path/log.txt", open = "wt")
# Write anything to a log file.
sink(connection, append = TRUE)
sink(connection, append = TRUE, type = "message")
# Create a cluster in the `.GlobalEnv`.
cluster <<- parallelly::makeClusterPSOCK(
rep("localhost", 2),
master = "localhost",
port = 11234,
manual = FALSE,
outfile = ""
)
})
The output in log.txt
contains the following:
[09:27:19.936] Set package option ‘parallelly.availableCores.methods’
[09:27:19.941] Environment variable ‘R_PARALLELLY_AVAILABLECORES_METHODS’ not set
[09:27:19.941] Set package option ‘parallelly.availableCores.fallback’
[09:27:19.941] Environment variable ‘R_PARALLELLY_AVAILABLECORES_FALLBACK’ not set
[09:27:19.941] Set package option ‘parallelly.availableCores.min’
[09:27:19.942] Environment variable ‘R_PARALLELLY_AVAILABLECORES_MIN’ not set
[09:27:19.942] Set package option ‘parallelly.availableCores.system’
[09:27:19.942] Environment variable ‘R_PARALLELLY_AVAILABLECORES_SYSTEM’ not set
[09:27:19.942] Set package option ‘parallelly.availableCores.logical’
[09:27:19.942] Environment variable ‘R_PARALLELLY_AVAILABLECORES_LOGICAL’ not set
[09:27:19.942] Set package option ‘parallelly.availableCores.omit’
[09:27:19.942] Environment variable ‘R_PARALLELLY_AVAILABLECORES_OMIT’ not set
[09:27:19.942] Set package option ‘parallelly.availableWorkers.methods’
[09:27:19.942] Environment variable ‘R_PARALLELLY_AVAILABLEWORKERS_METHODS’ not set
[09:27:19.942] Set package option ‘parallelly.fork.enable’
[09:27:19.942] Environment variable ‘R_PARALLELLY_FORK_ENABLE’ not set
[09:27:19.942] Set package option ‘parallelly.supportsMulticore.unstable’
[09:27:19.942] Environment variable ‘R_PARALLELLY_SUPPORTSMULTICORE_UNSTABLE’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.setup_strategy’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_SETUP_STRATEGY’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.validate’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_VALIDATE’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.connectTimeout’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_CONNECTTIMEOUT’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.timeout’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_TIMEOUT’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.useXDR’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_USEXDR’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.socketOptions’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_SOCKETOPTIONS’ not set
[09:27:19.943] Set package option ‘parallelly.makeNodePSOCK.rshcmd’
[09:27:19.943] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_RSHCMD’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.rshopts’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_RSHOPTS’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.tries’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_TRIES’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.tries.delay’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_TRIES_DELAY’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.rscript_label’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_RSCRIPT_LABEL’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.sessionInfo.pkgs’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_SESSIONINFO_PKGS’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.autoKill’
[09:27:19.944] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_AUTOKILL’ not set
[09:27:19.944] Set package option ‘parallelly.makeNodePSOCK.master.localhost.hostname’
[09:27:19.945] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_MASTER_LOCALHOST_HOSTNAME’ not set
[09:27:19.945] Set package option ‘parallelly.makeNodePSOCK.port.increment’
[09:27:19.945] Environment variable ‘R_PARALLELLY_MAKENODEPSOCK_PORT_INCREMENT’ not set
[09:27:19.945] parallelly-specific environment variables:
[09:27:19.961] [local output] Workers: [n = 2] ‘localhost’, ‘localhost’
[09:27:19.962] [local output] Base port: 11234
[09:27:19.962] [local output] Getting setup options for 2 cluster nodes ...
[09:27:19.963] [local output] - Node 1 of 2 ...
[09:27:19.963] [local output] localMachine=TRUE => revtunnel=FALSE
[09:27:19.963] Testing if worker's PID can be inferred: ‘'/Library/Frameworks/R.framework/Resources/bin/Rscript' -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")), silent = TRUE)' -e 'file.exists("/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")'’
[09:27:19.978] - Possible to infer worker's PID: FALSE
[09:27:19.978] [local output] Rscript port: 11234
[09:27:19.978] [local output] - Node 2 of 2 ...
[09:27:19.979] [local output] localMachine=TRUE => revtunnel=FALSE
[09:27:19.979] Testing if worker's PID can be inferred: ‘'/Library/Frameworks/R.framework/Resources/bin/Rscript' -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=2.parallelly.parent=3664.e502491c675.pid")), silent = TRUE)' -e 'file.exists("/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=2.parallelly.parent=3664.e502491c675.pid")'’
[09:27:19.995] - Possible to infer worker's PID: FALSE
[09:27:19.995] [local output] Rscript port: 11234
[09:27:19.995] [local output] Getting setup options for 2 cluster nodes ... done
[09:27:19.995] [local output] - Parallel setup requested for some PSOCK nodes
[09:27:19.996] [local output] Setting up PSOCK nodes in parallel
[09:27:19.996] List of 20
[09:27:19.996] $ local_cmd : chr "'/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,st"| __truncated__
[09:27:19.996] $ worker : chr "localhost"
[09:27:19.996] ..- attr(*, "localhost")= logi TRUE
[09:27:19.996] $ rank : int 1
[09:27:19.996] $ rshlogfile : NULL
[09:27:19.996] $ port : int 11234
[09:27:19.996] $ connectTimeout: num 120
[09:27:19.996] $ timeout : num 2592000
[09:27:19.996] $ useXDR : logi FALSE
[09:27:19.996] $ pidfile : chr "/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid"
[09:27:19.996] $ setup_strategy: chr "parallel"
[09:27:19.996] $ outfile : chr ""
[09:27:19.996] $ rshcmd_label : NULL
[09:27:19.996] $ rsh_call : NULL
[09:27:19.996] $ cmd : chr "'/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,st"| __truncated__
[09:27:19.996] $ localMachine : logi TRUE
[09:27:19.996] $ manual : logi FALSE
[09:27:19.996] $ dryrun : logi FALSE
[09:27:19.996] $ quiet : logi FALSE
[09:27:19.996] $ rshcmd : NULL
[09:27:19.996] $ revtunnel : logi FALSE
[09:27:19.996] - attr(*, "class")= chr [1:2] "makeNodePSOCKOptions" "makeNodeOptions"
[09:27:20.001] [local output] System call to launch all workers:
[09:27:20.001] [local output] '/Library/Frameworks/R.framework/Resources/bin/Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")), silent = TRUE)' -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11234 OUT= TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=parallel
[09:27:20.001] [local output] Starting PSOCK main server
[09:27:20.003] [local output] Workers launched
[09:27:20.003] [local output] Waiting for workers to connect back
[09:27:20.003] [local output] 0 workers out of 2 ready
[09:27:20.120] [local output] 0 workers out of 2 ready
[09:27:20.121] [local output] 1 workers out of 2 ready
[09:29:20.126] [local output] 1 workers out of 2 ready
[09:31:20.133] [local output] 1 workers out of 2 ready
Error in parallelly::makeClusterPSOCK(rep("localhost", 2), master = "localhost", :
Cluster setup failed. 1 worker of 2 failed to connect.
In addition: Warning messages:
1: In system(test_cmd, intern = TRUE, input = input) :
running command ''/Library/Frameworks/R.framework/Resources/bin/Rscript' -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")), silent = TRUE)' -e 'file.exists("/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=1.parallelly.parent=3664.e504deb9152.pid")'' had status 2
2: In system(test_cmd, intern = TRUE, input = input) :
running command ''/Library/Frameworks/R.framework/Resources/bin/Rscript' -e 'try(suppressWarnings(cat(Sys.getpid(),file="/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=2.parallelly.parent=3664.e502491c675.pid")), silent = TRUE)' -e 'file.exists("/var/folders/v9/yftkgjms78qdx9vz7mll88mw0000gn/T//RtmpRVw8vP/worker.rank=2.parallelly.parent=3664.e502491c675.pid")'' had status 2
It looks like the Rscript
command has an exit status of 2
, which from the documentation sounds like something cryptic happened:
Error status
2
is used forR
'suicide', that is a catastrophic failure, and other small numbers are used by specific ports for initialization failures.
R --vanilla
andRscript --vanilla script.R
. – Rumpflocalhost
being an issue, try using127.0.0.1
for the worker hostnames. – Rumpfparallel::makePSOCKcluster()
withparallelly::makeClusterPSOCK()
as it gives more informative error messages. You can also setoptions(parallelly.debug = TRUE)
to get detailed output on the what's taking place internally. – Rumpf127.0.0.1
for the worker hostnames (i.e., or both for the workers and the master), no workers are created. With thelocalhost
hostname, one gets started, and the other one hangs. Does this suggest anything to you? I will update in a moment my answer to include the debug logs fromoptions(parallelly.debug = TRUE)
. – Ted