How to store data in a Spark cluster using sparklyr?
Asked Answered
V

1

6

If I connect to a Spark cluster, copy some data to it, and disconnect, ...

library(dplyr)
library(sparklyr)
sc <- spark_connect("local")
copy_to(sc, iris)
src_tbls(sc)
## [1] "iris"
spark_disconnect(sc)

then the next time I connect to Spark, the data is not there.

sc <- spark_connect("local")
src_tbls(sc)
## character(0)
spark_disconnect(sc)

This is different to the situation of working with a database, where regardless of how many times you connect, the data is just there.

How do I persist data in the Spark cluster between connections?

I thought sdf_persist() might be what I want, but it appears not.

Verniavernice answered 23/2, 2017 at 13:40 Comment(7)
It's because data doesn't persist over different spark sessions, which is what happens if you disconnect and than reconnect again.Acrogen
@Acrogen Thanks. So there is no way to keep a session alive when you disconnect?Verniavernice
Can you try with sdf_persist(storage.level = "DISK_ONLY") ? I'm not sure that it will work thought. I have never tried that with spark to be honestHoneyman
@RichieCotton Probably only an issue in "local" mode. But to connect to a remote cluster, you'll need rstudio server installed on the cluster as well.Acrogen
@Honeyman Sorry, sdf_persist(storage.level = "DISK_ONLY") doesn't work; it still connects to an empty session.Verniavernice
@RichieCotton Did you learn something new about this problem?Inimical
@Inimical There is no permanence between clusters. People seem to just keep clusters running indefinitely, or save/reload their datasets using spark_write_parquet() and spark_read_parquet() (much faster than copy_to()).Verniavernice
T
2

Spark is technically an engine that runs on the computer/cluster to execute tasks. It is not a database or file-system. You can save the data when you are done to a file-system and load it up during your next session.

https://en.wikipedia.org/wiki/Apache_Spark

Trot answered 27/4, 2017 at 13:56 Comment(2)
yeah, this seems about right. But is there a workaround for this?, some way to more tightly integrate Spark with a database or filesystem so that the data, only loaded, is always available every time you fire up Spark? ..., of course you can always load the data up during the next session. But at least in my experience, copying the data to Spark is time consuming.Tortricid
Good question, I haven't seen anything like that. What I typically do is save my datasets in iterations as parquet files and load them as needed. So if you have a large set of data that takes a long time to run, load it, do an initial set of work, save that work and when you start later, load in that intermediate file.Trot

© 2022 - 2024 — McMap. All rights reserved.