How do you read and write from/into different ElasticSearch clusters using spark and elasticsearch-hadoop?
Asked Answered
F

1

6

Original title: Besides HDFS, what other DFS does spark support (and are recommeded)?

I am happily using spark and elasticsearch (with elasticsearch-hadoop driver) with several gigantic clusters.

From time to time, I would like to pull the entire cluster of data out, process each doc, and put all of them into a different Elasticsearch (ES) cluster (yes, data migration too).

Currently, there is no way to read ES data from a cluster into RDDs and write the RDDs into a different one with spark + elasticsearch-hadoop, because that would involve swapping SparkContext from RDD. So I would like to write the RDD into object files and then later on read them back into RDDs with different SparkContexts.

However, here comes the problem: I then need a DFS(Distributed File System) to share the big files across my entire spark cluster. The most popular solution is HDFS, but I would very much avoid introducing Hadoop into my stack. Is there any other recommended DFS that spark supports?

Update Below

Thanks to @Daniel Darabos's answer below, I can now read and write data from/into different ElasticSearch clusters using the following Scala code:

val conf = new SparkConf().setAppName("Spark Migrating ES Data")
conf.set("es.nodes", "from.escluster.com")

val sc = new SparkContext(conf)

val allDataRDD = sc.esRDD("some/lovelydata")

val cfg = Map("es.nodes" -> "to.escluster.com")
allDataRDD.saveToEsWithMeta("clone/lovelydata", cfg)
Flit answered 12/3, 2015 at 1:2 Comment(0)
B
3

Spark uses the hadoop-common library for file access, so whatever file systems Hadoop supports will work with Spark. I've used it with HDFS, S3 and GCS.

I'm not sure I understand why you don't just use elasticsearch-hadoop. You have two ES clusters, so you need to access them with different configurations. sc.newAPIHadoopFile and rdd.saveAsHadoopFile take hadoop.conf.Configuration arguments. So you can without any problems use two ES clusters with the same SparkContext.

Byroad answered 12/3, 2015 at 12:17 Comment(3)
thanks for the response. I am using the new read (sc.esRDD) and write (rss.saveToEs) functions proposed by elasticsearch-hadoop. There is no way to read and write from different clusters this way. Thanks for bringing rdd.saveAsHadoopFile up, I am looking into it to see of I can somehow go from there.Flit
Ah, I didn't know about these methods (esRDD and saveToEs). I see they take a cfg: Map[String, String] argument. Could that not be used to provide different configurations for reading and writing?Byroad
thanks for the update. I was such an idiot to ignore the cfg. It's working!! I will update my node up to the question for everybody. Thanks again.Flit

© 2022 - 2024 — McMap. All rights reserved.