LSH Spark stucks forever at approxSimilarityJoin() function
Asked Answered
C

1

3

I am trying to implement LSH spark to find nearest neighbours for each user on very large datasets containing 50000 rows and ~5000 features for each row. Here is the code related to this.

    MinHashLSH mh = new MinHashLSH().setNumHashTables(3).setInputCol("features")
                        .setOutputCol("hashes");

    MinHashLSHModel model = mh.fit(dataset);

    Dataset<Row> approxSimilarityJoin = model .approxSimilarityJoin(dataset, dataset, config.getJaccardLimit(), "JaccardDistance");

    approxSimilarityJoin.show();

The job gets stuck at approxSimilarityJoin() function and never goes beyond it. Please let me know how to solve it.

Collyrium answered 22/2, 2018 at 12:17 Comment(0)
A
10

It will finish if you leave it long enough, however there are some things you can do to speed it up. Reviewing the source code you can see the algorithm

  1. hashes the inputs
  2. joins the 2 datasets on the hashes
  3. computes the jaccard distance using a udf and
  4. filters the dataset with your threshold.

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala

The join is probably the slow part here as the data is shuffled. So some things to try:

  1. change your dataframe input partitioning
  2. change spark.sql.shuffle.partitions (the default gives you 200 partitions after a join)
  3. your dataset looks small enough where you could use spark.sql.functions.broadcast(dataset) for a map-side join
  4. Are these vectors sparse or dense? the algorithm works better with sparseVectors.

Of these 4 options 2 and 3 have worked best for me while always using sparseVectors.

Actinoid answered 27/3, 2018 at 15:6 Comment(2)
My job was stuck for the same reason. 2 works for me.@ActinoidButts
I've similar use case with one dataset ~320MB .parquet file..second dataset is pretty small but when I want to do fuzzy match, it takes ~150ms per record on the second dataset. Tried #2 above, didn't work.Acacia

© 2022 - 2024 — McMap. All rights reserved.