I build a RDD from a list of urls, and then try to fetch datas with some async http call. I need all the results before doing other calculs. Ideally, I need to make the http calls on differents nodes for scaling considerations.
I did something like this:
//init spark
val sparkContext = new SparkContext(conf)
val datas = Seq[String]("url1", "url2")
//create rdd
val rdd = sparkContext.parallelize[String](datas)
//httpCall return Future[String]
val requests = rdd.map((url: String) => httpCall(url))
//await all results (Future.sequence may be better)
val responses = requests.map(r => Await.result(r, 10.seconds))
//print responses
response.collect().foreach((s: String) => println(s))
//stop spark
sparkContext.stop()
This work, but Spark job never finish !
So I wonder what is are the best practices for dealing with Future using Spark (or Future[RDD]).
I think this use case looks pretty common, but didn't find any answer yet.
Best regards