If I connect to a Spark cluster, copy some data to it, and disconnect, ...
library(dplyr)
library(sparklyr)
sc <- spark_connect("local")
copy_to(sc, iris)
src_tbls(sc)
## [1] "iris"
spark_disconnect(sc)
then the next time I connect to Spark, the data is not there.
sc <- spark_connect("local")
src_tbls(sc)
## character(0)
spark_disconnect(sc)
This is different to the situation of working with a database, where regardless of how many times you connect, the data is just there.
How do I persist data in the Spark cluster between connections?
I thought sdf_persist()
might be what I want, but it appears not.
sdf_persist(storage.level = "DISK_ONLY")
? I'm not sure that it will work thought. I have never tried that with spark to be honest – Honeyman"local"
mode. But to connect to a remote cluster, you'll need rstudio server installed on the cluster as well. – Acrogensdf_persist(storage.level = "DISK_ONLY")
doesn't work; it still connects to an empty session. – Verniavernicespark_write_parquet()
andspark_read_parquet()
(much faster thancopy_to()
). – Verniavernice