Hadoop DistributedCache functionality in Spark
Asked Answered
H

2

6

I am looking for a functionality similar to the distributed cache of Hadoop in Spark. I need a relatively small data file (with some index values) to be present in all nodes in order to make some calculations. Is there any approach that makes this possible in Spark?

My workaround so far consists on distributing and reducing the index file as a normal processing, which takes around 10 seconds in my application. After that, I persist the file indicating it as a broadcast variable, as follows:

JavaRDD<String> indexFile = ctx.textFile("s3n://mybucket/input/indexFile.txt",1);
ArrayList<String> localIndex = (ArrayList<String>) indexFile.collect();    

final Broadcast<ArrayList<String>> globalIndex = ctx.broadcast(indexVar);

This makes the program able to understand what the variable globalIndex contains. So far it is a patch that might be okay for me, but I consider it is not the best solution. Would it still be effective with a considerably bigger data-set or a big amount of variables?

Note: I am using Spark 1.0.0 running on a Standalone cluster located at several EC2 instances.

Happen answered 2/9, 2014 at 14:20 Comment(4)
Can you not cache the file? Basically store as a RDD. Will be scalable.Gaur
I think broadcasting the variable is the same as caching it. My question is directed mostly to know whether a direct caching method exist or not, without having to 'process' it first.Happen
So personally I think a broadcast var is better than the distcache in terms of usability, but is there a reason you could just use Hadoop's distributed cacheSupererogate
@Gaur this is not an equivalent solution, distcache is for storing the same data on multiple nodes, RDD is for storing different data on each node and is not scalable for this purposeSupererogate
R
6

Please have a look at SparkContext.addFile() method. Guess that is what you were looking for.

Roughandtumble answered 19/2, 2016 at 0:49 Comment(0)
W
0

As long as we use Broadcast variables, it should be effective with larger dataset as well.

From the Spark documentation "Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner."

Williemaewillies answered 28/1, 2015 at 13:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.