Jaccard Similarity of an RDD with the help of Spark and Scala without Cartesian?

val filterOnjoin = a.cartesian(a).filter(f => (!f._1._1.toString().contentEquals(f._2._1.toString()))) //Cartesianproduct of rdd a and filtering rows with same key at both //the position. //e.g. ((India,Set[Country,Place,....]),(USA,Set[Country,State,..]))

As Cartesian product is an expensive operation on rdd, I tried to solve above problem by using HashingTF and MinHashLSH library present in Spark MLib for finding jaccard similarity. Steps to find Jaccard similarity in rdd "a" mentioned in the question:

Convert the rdd into dataframe

 import sparkSession.implicits._  
 val dfA = a.toDF("id", "values")

Create the feature vector with the help of HashingTF

  val hashingTF = new HashingTF()
 .setInputCol("values").setOutputCol("features").setNumFeatures(1048576)

Feature transformation

val featurizedData = hashingTF.transform(dfA) //Feature Transformation

Creating minHash table. More is the value of number of table, more accurate results will be, but high communication cost and run time.
```
 val mh = new MinHashLSH()
        .setNumHashTables(3) 
        .setInputCol("features")
        .setOutputCol("hashes")
```
Approximate similarity join takes two datasets and approximately returns pairs of rows in the datasets whose distance is smaller than a user-defined threshold. Approximate similarity join supports both joining two different datasets and self-joining. Self-joining will produce some duplicate pairs.
```
  val model = mh.fit(featurizedData)  
  //Approximately joining featurizedData with Jaccard distance smaller 
  //than 0.45
 val dffilter = model.approxSimilarityJoin(featurizedData, featurizedData, 
                0.45)    
```

Since in spark, we have to do manual optimization in our code like setting of number of partition, setting persist level etc. I have configured these parameters also.

Changing storaagelevel from persist() to persist(StorageLevel.MEMORY_AND_DISK), it help me to remove OOM error.
Also while doing join operation, re-partitioned the data according to the rdd size. On 16.6 GB data set, while doing simple join operation, I was using 200 partition. On increase it to 600, it also solves my problem related to OOM.

PS: the constant parameters setNumFeatures(1048576) and setNumHashTables(3) are configured while experimenting on 16.6 data set. You can increase or decrease these value according to your data set. Also the number of partition depends upon your data set size. With these optimization, I got my desired results.

Useful links:-
[https://spark.apache.org/docs/2.2.0/ml-features.html#locality-sensitive-hashing]
[https://eng.uber.com/lsh/]
[https://data-flair.training/blogs/limitations-of-apache-spark/]

Recommended topics

Hot tags