I am working on pair RDDs. My aim is to calculate jaccard similarity between the set of rdd values and cluster them according to the jaccard similarity threshold value.Structure of my RDD is :
val a= [Key,Set(String)] //Pair RDD
For example:-
India,[Country,Place,....]
USA,[Country,State,..]
Berlin,[City,Popluatedplace,..]
After finding jaccard similarity, I will cluster the similar entities into one cluster. In the above example, India and USA will be cluster into one cluster based on some threshold value whereas Berlin will be in the other cluster.
So I took the Cartesian product of rdd a
val filterOnjoin = a.cartesian(a).filter(f =>
(!f._1._1.toString().contentEquals(f._2._1.toString())))
//Cartesianproduct of rdd a and filtering rows with same key at both
//the position.
//e.g. ((India,Set[Country,Place,....]),(USA,Set[Country,State,..]))
and compare the set of values with the help of jaccard similarity.
val Jsim = filterOnjoin.map(f => (f._1._1, (f._2._1,
Similarity.sim(f._1._2, f._2._2)))) //calculating jaccard similarity.
//(India,USA,0.8)
The code is running fine on smaller dataset. As the size of dataset is increased, Cartesian product is taking too much time. For 100 MB data(size of rdd "a"), its doing data shuffle read around 25 GB. For 3.5 GB data, its in TB.
I have gone through various links. Like spark tuning methods and some on stack overflow. But most of the post it is written that broadcast the smaller RDD. But here the size of both the rdd is the same and its big.
Links which I followed :-
Spark: produce RDD[(X, X)] of all possible combinations from RDD[X] of-all-possible-combinations-from-rddx
Spark repartition is slow and shuffles too much data
Map key, value pair based on similarity of their value in Spark
I am new to Spark and Scala. I am unable to think beyond Cartesian product which is bottleneck here. Is it possible to solve this problem without Cartesian product.