How to store data in a Spark cluster using sparklyr?

About

Asked 23/2, 2017 at 13:40 Answered 27/4, 2017 at 13:56

If I connect to a Spark cluster, copy some data to it, and disconnect, ...

library(dplyr)
library(sparklyr)
sc <- spark_connect("local")
copy_to(sc, iris)
src_tbls(sc)
## [1] "iris"
spark_disconnect(sc)

then the next time I connect to Spark, the data is not there.

sc <- spark_connect("local")
src_tbls(sc)
## character(0)
spark_disconnect(sc)

This is different to the situation of working with a database, where regardless of how many times you connect, the data is just there.

How do I persist data in the Spark cluster between connections?

I thought sdf_persist() might be what I want, but it appears not.

Verniavernice answered 23/2, 2017 at 13:40 Comment(7)

It's because data doesn't persist over different spark sessions, which is what happens if you disconnect and than reconnect again. – Acrogen 23/2, 2017 at 13:50

@Acrogen Thanks. So there is no way to keep a session alive when you disconnect? – Verniavernice 23/2, 2017 at 14:4

Can you try with sdf_persist(storage.level = "DISK_ONLY") ? I'm not sure that it will work thought. I have never tried that with spark to be honest – Honeyman 23/2, 2017 at 14:6

@RichieCotton Probably only an issue in "local" mode. But to connect to a remote cluster, you'll need rstudio server installed on the cluster as well. – Acrogen 23/2, 2017 at 14:24

@Honeyman Sorry, sdf_persist(storage.level = "DISK_ONLY") doesn't work; it still connects to an empty session. – Verniavernice 23/2, 2017 at 17:11

@RichieCotton Did you learn something new about this problem? – Inimical 30/4, 2017 at 20:35

@Inimical There is no permanence between clusters. People seem to just keep clusters running indefinitely, or save/reload their datasets using spark_write_parquet() and spark_read_parquet() (much faster than copy_to()). – Verniavernice 2/5, 2017 at 19:49

Spark is technically an engine that runs on the computer/cluster to execute tasks. It is not a database or file-system. You can save the data when you are done to a file-system and load it up during your next session.

https://en.wikipedia.org/wiki/Apache_Spark

Trot answered 27/4, 2017 at 13:56 Comment(2)

yeah, this seems about right. But is there a workaround for this?, some way to more tightly integrate Spark with a database or filesystem so that the data, only loaded, is always available every time you fire up Spark? ..., of course you can always load the data up during the next session. But at least in my experience, copying the data to Spark is time consuming. – Tortricid 28/4, 2017 at 17:55

Good question, I haven't seen anything like that. What I typically do is save my datasets in iterations as parquet files and load them as needed. So if you have a large set of data that takes a long time to run, load it, do an initial set of work, save that work and when you start later, load in that intermediate file. – Trot 1/5, 2017 at 16:17

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags