I am trying to apply BucketedRandomProjectionLSH's function model.approxNearestNeighbors(df, key, n)
on all the rows of a dataframe in order to approx-find the top n most similar items for every item. My dataframe has 1 million rows.
My problem is that I have to find a way to compute it within a reasonable time (no more than 2 hrs). I've read about that function approxSimilarityJoin(df, df, threshold)
but the function takes way too long and doesn't return the right number of rows : if my dataframe has 100.000 rows, and I set a threshold VERY high/permissive I get something like not even 10% of the number of rows returned.
So, I'm thinking about using approxNearestNeighbors
on all rows so that the computation time is almost linear.
How do you apply that function to every row of a dataframe ? I can't use a UDF since I need the model + a dataframe as inputs.
Do you have any suggestions ?