Original title: Besides HDFS, what other DFS does spark support (and are recommeded)?
I am happily using spark and elasticsearch (with elasticsearch-hadoop driver) with several gigantic clusters.
From time to time, I would like to pull the entire cluster of data out, process each doc, and put all of them into a different Elasticsearch (ES) cluster (yes, data migration too).
Currently, there is no way to read ES data from a cluster into RDDs and write the RDDs into a different one with spark + elasticsearch-hadoop, because that would involve swapping SparkContext
from RDD. So I would like to write the RDD into object files and then later on read them back into RDDs with different SparkContext
s.
However, here comes the problem: I then need a DFS(Distributed File System) to share the big files across my entire spark cluster. The most popular solution is HDFS, but I would very much avoid introducing Hadoop into my stack. Is there any other recommended DFS that spark supports?
Update Below
Thanks to @Daniel Darabos's answer below, I can now read and write data from/into different ElasticSearch clusters using the following Scala code:
val conf = new SparkConf().setAppName("Spark Migrating ES Data")
conf.set("es.nodes", "from.escluster.com")
val sc = new SparkContext(conf)
val allDataRDD = sc.esRDD("some/lovelydata")
val cfg = Map("es.nodes" -> "to.escluster.com")
allDataRDD.saveToEsWithMeta("clone/lovelydata", cfg)