LSH Spark stucks forever at approxSimilarityJoin() function

MinHashLSH mh = new MinHashLSH().setNumHashTables(3).setInputCol("features") .setOutputCol("hashes"); MinHashLSHModel model = mh.fit(dataset); Dataset<Row> approxSimilarityJoin = model .approxSimilarityJoin(dataset, dataset, config.getJaccardLimit(), "JaccardDistance"); approxSimilarityJoin.show();

It will finish if you leave it long enough, however there are some things you can do to speed it up. Reviewing the source code you can see the algorithm

hashes the inputs
joins the 2 datasets on the hashes
computes the jaccard distance using a udf and
filters the dataset with your threshold.

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala

The join is probably the slow part here as the data is shuffled. So some things to try:

change your dataframe input partitioning
change spark.sql.shuffle.partitions (the default gives you 200 partitions after a join)
your dataset looks small enough where you could use spark.sql.functions.broadcast(dataset) for a map-side join
Are these vectors sparse or dense? the algorithm works better with sparseVectors.

Of these 4 options 2 and 3 have worked best for me while always using sparseVectors.

Recommended topics

Hot tags