Can sparklyr be used with spark deployed on yarn-managed hadoop cluster?
Asked Answered
M

4

5

Is the sparklyr R package able to connect to YARN-managed hadoop clusters? This doesn't seem to be documented in the cluster deployment documentation. Using the SparkR package that ships with Spark it is possible by doing:

# set R environment variables
Sys.setenv(YARN_CONF_DIR=...)
Sys.setenv(SPARK_CONF_DIR=...)
Sys.setenv(LD_LIBRARY_PATH=...)
Sys.setenv(SPARKR_SUBMIT_ARGS=...)

spark_lib_dir <- ... # install specific
library(SparkR, lib.loc = c(sparkr_lib_dir, .libPaths()))
sc <- sparkR.init(master = "yarn-client")

However when I swaped the last lines above with

library(sparklyr)
sc <- spark_connect(master = "yarn-client")

I get errors:

Error in start_shell(scon, list(), jars, packages) : 
  Failed to launch Spark shell. Ports file does not exist.
    Path: /usr/hdp/2.4.2.0-258/spark/bin/spark-submit
    Parameters: '--packages' 'com.databricks:spark-csv_2.11:1.3.0,com.amazonaws:aws-java-sdk-pom:1.10.34' '--jars' '<path to R lib>/3.2/sparklyr/java/rspark_utils.jar'  sparkr-shell /tmp/RtmpT31OQT/filecfb07d7f8bfd.out

Ivy Default Cache set to: /home/mpollock/.ivy2/cache
The jars for the packages stored in: /home/mpollock/.ivy2/jars
:: loading settings :: url = jar:file:<path to spark install>/lib/spark-assembly-1.6.1.2.4.2.0-258-hadoop2.7.1.2.4.2.0-258.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
:: resolution report :: resolve 480ms :: artifacts dl 0ms
    :: modules in use:
    -----------------------------------------

Is sparklyr an alternative to SparkR or is it built on top of the SparkR package?

Merous answered 29/6, 2016 at 14:42 Comment(2)
Looking at the sparkapi readme the answer to the last question is clearly "it is an alternative to SparkR". Still not sure how to use master='yarn-client' thoughMerous
Related question: #38486663 - seems that the issue keeps popping up in different OS & configurationsAnuradhapura
L
5

Yes, sparklyr can be used against a yarn-managed cluster. In order to connect to yarn-managed clusters one needs to:

  1. Set SPARK_HOME environment variable to point to the right spark home directory.
  2. Connect to the spark cluster using the appropriate master location, for instance: sc <- spark_connect(master = "yarn-client")

See also: http://spark.rstudio.com/deployment.html

Loisloise answered 29/6, 2016 at 18:41 Comment(4)
I tried setting SPARK_HOME which took, but the ports file issue remains. It is not clear to me exactly what spark_connect is looking for or where it is looking. Is it necessary to pull out names and ports from yarn-site.xml?Merous
Currently, sparklyr is an alternative to sparkr; I have not tried using them both side-by-side since this is currently unsupported. Could you confirm that you are running your script without the sparkr library loaded. If that still does not work, could you dump your system information: OS, version, x86/x64, spark redistribution, etc for us to take a look and reproduce this? Would also be appreciated to open this issue under github.com/rstudio.sparklyr to have more people helping unblock this.Loisloise
I finally got things working by adding config=list() to the inputs of spark_connect(). Seems that the error message is a bit misleading. Is the real issue around getting the spark packages installed?Merous
In older versions of sparklyr we specified a CSV package that during spark_connect(), Spark would download from Spark's online package repo and therefore, spark_connect() required internet connectivity unless config = list() was specified to override adding this CSV package. Newer versions of sparklyr embed the CSV package to avoid requiring internet connectivity and the config=list() is no longer required for offline clusters.Loisloise
N
2

Yes it can but there is one catch to everything else that has been written, which is very elusive in the blogging literature, and that centers around configuring the resources.

The key is this: when you have it executing in local mode you do not have to configure the resources declaratively, but when you execute in the YARN cluster, you absolutely do have to declare those resources. It took me a long time to find the article that shed some light on this issue but once I tried it, it Worked.

Here's an (arbitrary) example with the key reference:

config <- spark_config()
config$spark.driver.cores <- 32
config$spark.executor.cores <- 32
config$spark.executor.memory <- "40g"

library(sparklyr)

Sys.setenv(SPARK_HOME = "/usr/local/spark")
Sys.setenv(HADOOP_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')
Sys.setenv(YARN_CONF_DIR = '/usr/local/hadoop/etc/hadoop/conf')

config <- spark_config()
config$spark.executor.instances <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"

sc <- spark_connect(master="yarn-client", config=config, version = '2.1.0')

R Bloggers Link to Article

Napoleon answered 17/3, 2017 at 22:48 Comment(0)
A
0

Are you possibly using Cloudera Hadoop (CDH)?

I am asking as I had the same issue when using the CDH-provided Spark distro:

Sys.getenv('SPARK_HOME')
[1] "/usr/lib/spark"  # CDH-provided Spark
library(sparklyr)
sc <- spark_connect(master = "yarn-client")
Error in sparkapi::start_shell(master = master, spark_home = spark_home,  : 
      Failed to launch Spark shell. Ports file does not exist.
        Path: /usr/lib/spark/bin/spark-submit
        Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', --packages, 'com.databricks:spark-csv_2.11:1.3.0','com.amazonaws:aws-java-sdk-pom:1.10.34', sparkr-shell, /tmp/Rtmp6RwEnV/file307975dc1ea0.out

Ivy Default Cache set to: /home/oracle/.ivy2/cache
The jars for the packages stored in: /home/oracle/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/lib/spark-assembly-1.6.0-cdh5.7.0-hadoop2.6.0-cdh5.7.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found com.databricks#spark-csv_2.11;1.3.0 in central
    found org.apache.commons#commons-csv;1.1 in central
    found com.univocity#univocity-parsers;1.5.1 in central
    found com.

However, after I downloaded a pre-built version from Databricks (Spark 1.6.1, Hadoop 2.6) and pointed SPARK_HOME there, I was able to connect successfully:

Sys.setenv(SPARK_HOME = '/home/oracle/spark-1.6.1-bin-hadoop2.6') 
sc <- spark_connect(master = "yarn-client") # OK
library(dplyr)
iris_tbl <- copy_to(sc, iris)
src_tbls(sc)
[1] "iris"

Cloudera does not yet include SparkR in its distribution, and I suspect that sparklyr may still have some subtle dependency on SparkR. Here are the results when trying to work with the CDH-provided Spark, but using the config=list() argument, as suggested in this thread from sparklyr issues at Github:

sc <- spark_connect(master='yarn-client', config=list()) # with CDH-provided Spark
Error in sparkapi::start_shell(master = master, spark_home = spark_home,  : 
  Failed to launch Spark shell. Ports file does not exist.
    Path: /usr/lib/spark/bin/spark-submit
    Parameters: --jars, '/u01/app/oracle/product/12.1.0.2/dbhome_1/R/library/sparklyr/java/sparklyr.jar', sparkr-shell, /tmp/Rtmpi9KWFt/file22276cf51d90.out

Error: sparkr.zip does not exist for R application in YARN mode.

Also, if you check the rightmost part of the Parameters part of the error (both yours and mine), you'll see a reference to sparkr-shell...

(Tested with sparklyr 0.2.28, sparkapi 0.3.15, R session from RStudio Server, Oracle Linux)

Anuradhapura answered 20/7, 2016 at 16:20 Comment(1)
Thanks much. I am however on a HDP cluster with spark 1.6.1 - so the under-the-hood R methods should be available in spark. The issue seems to be that I lack a certain port config file that is not apparently needed for anything else.Merous
L
0

An upgrade to sparklyr version 0.2.30 or newer is recommended for this issue. Upgrade using devtools::install_github("rstudio/sparklyr") followed by restarting the r session.

Loisloise answered 26/7, 2016 at 7:18 Comment(2)
Thanks for following up, but updating (to 0.2.31) did not resolve the port file issue. The spark installation on my cluster does not seem to have the config file that is expected. sparklyr tried to call .../spark/bin/spark-submit but the config files are .../spark/conf which has things like hive-site.xml andspark-defaults.conf but no "ports" file.Merous
I should note that this spark installation has been heavily used with both pyspark and SparkR without issue.Merous

© 2022 - 2024 — McMap. All rights reserved.