Connect sparklyr to remote spark connection
Asked Answered
C

3

9

I would like to connect my local desktop RStudio session to a remote spark session via sparklyr. When you go to add a new connection in the sparklyr ui tab in RStudio and choose cluster is says that you have to be running on the cluster, or have a high bandwidth connection to the cluster.

Can anyone shed light on how to create that kind of connection? I am not sure how to create reproducible example of this, but in general what I would like to do is:

library(sparklyr)
sc <- spark_connect(master = "spark://ip-[MY_PRIVATE_IP]:7077", spark_home = "/home/ubuntu/spark-2.0.0", version="2.0.0")

from a remote server. I understand that there will be latency, especially if trying to pass data between the remotes. I also understand that it would be better to have the rstudio-server on the actual cluster- but that is not always possible, and I am looking for a sparklyr option for interacting between my server and my desktop RStudio session. Thanks.

Cyclopropane answered 30/9, 2016 at 19:28 Comment(1)
Is it throwing an error when you try to use spark_connect?Arlindaarline
B
8

As of sparklyr version 0.4, it is unsupported to connect from the RStudio desktop to a remote Spark cluster. Instead, as you mention, the recommended approach is to install RStudio Server within the Spark cluster.

That said, the livy branch in sparklyr is exploring integration with Livy that would enable the RStudio desktop to connect to a remote Spark cluster through Livy.

Beeman answered 1/11, 2016 at 17:38 Comment(0)
T
8

Using more recent version of sparklyr (version 0.9.2 for example) it's possible to connect to a remote Spark cluster.

Here is an example to connect to a Spark standalone cluster version 2.3.1. See Master URLs for other master URL schemes.

#install.packages("sparklyr")
library(sparklyr)

# You have to install locally (on the driver where RStudio is running) the same Spark version
spark_v <- "2.3.1"
cat("Installing Spark in the directory:", spark_install_dir())
spark_install(version = spark_v)

sc <- spark_connect(spark_home = spark_install_find(version=spark_v)$sparkVersionDir, 
                    master = "spark://ip-[MY_PRIVATE_IP]:7077")

sc$master
# "spark://ip-[MY_PRIVATE_IP]:7077"

I've written a post on this topic.

Teresitateressa answered 27/11, 2018 at 15:49 Comment(0)
E
2

I finally managed to connect my local R to a cloud instance of Spark cluster (HD insights in my case) using Livy

within sparklyr's spark_connect there is an option to connect to livy. (Method = "livy")

sc <- spark_connect(master = "https://<clustername>.azurehdinsight.net/livy/",
                     method = "livy", config = livy_config(
                       username = "<admin>",
                       password = rstudioapi::askForPassword("Livy password:")))
Equipage answered 24/1, 2019 at 5:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.