Pyspark LSH Followed by Cosine Similarity
Asked Answered
T

1

6

I have many users and each user has an associated vector. I would like to compute the cosine similarity between each user. This is prohibitive based on the size. It seems LSH is a good approximation step, which I understand will create buckets where the users in this case, are mapped to the same bucket where there is high probability that they are similar. In Pyspark, the following example:

from pyspark.ml.feature import BucketedRandomProjectionLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col

dataA = [(0, Vectors.dense([1.0, 1.0]),),
         (1, Vectors.dense([1.0, -1.0]),),
         (4, Vectors.dense([1.0, -1.0]),),
         (5, Vectors.dense([1.1, -1.0]),),
         (2, Vectors.dense([-1.0, -1.0]),),
         (3, Vectors.dense([-1.0, 1.0]),)]
dfA = ss.createDataFrame(dataA, ["id", "features"])

brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", bucketLength=1.0, numHashTables=3)
model = brp.fit(dfA)
model.transform(dfA).show(truncate=False)


+---+-----------+-----------------------+
|id |features   |hashes                 |
+---+-----------+-----------------------+
|0  |[1.0,1.0]  |[[-1.0], [0.0], [-1.0]]|
|1  |[1.0,-1.0] |[[-2.0], [-2.0], [1.0]]|
|4  |[1.0,-1.0] |[[-2.0], [-2.0], [1.0]]|
|5  |[1.1,-1.0] |[[-2.0], [-2.0], [1.0]]|
|2  |[-1.0,-1.0]|[[0.0], [-1.0], [0.0]] |
|3  |[-1.0,1.0] |[[1.0], [1.0], [-2.0]] |
+---+-----------+-----------------------+

Any pointers to how to best set bucketLength and numHashTables are appreciated.

Assuming I have the above with 3 hash tables, how can I determine the buckets from within each to calculate the cosine similarity given that there are more than 1? I assumed the use of LSH for this task is to group by the value in the "hashes" column and only perform pairwise similarity within each. Is this correct?

Turbo answered 10/6, 2022 at 20:56 Comment(0)
S
3

I assumed the use of LSH for this task is to group by the value in the "hashes" column and only perform pairwise similarity within each. Is this correct?

Yes, LSH uses a method to reduce dimensionality while preserving similarity. It hashes your data into a bucket. Only items that end up in the same bucket are then compared.(Distance calculated)

The magic is tuning the number of buckets and hash functions to reduce the number of false positives and false negatives. There isn't a set number it depends on your data.

r is your bucket size, b is the number of hash functions to use(Or the number of buckets you will be using to detect matches.

From this article that helped me understand what was happening.

Let’s say your signature matrix has 100 rows. Consider 2 cases:

b1 = 10 → r = 10

b2 = 20 → r = 5

In 2nd case, there is higher chance for 2 [vectors] to appear in same bucket at least once as they have more opportunities (20 vs 10) and fewer elements of the signature are getting compared (5 vs 10)

If you need to join you can use: approxSimilarityJoin and set the acceptable distance. (This is another parameter that you need to tune, Distance is the distance between vectors that have fallen into at least on of the hash buckets, making them likely to be close to eachother.)

distance = 300

model.approxSimilarityJoin(df, df2, distance, distCol="EuclideanDistance").select(
    col("datasetA.id").alias("idA"),
    col("datasetB.id").alias("idB"),
    col("EuclideanDistance")).show()

You can get an idea of what reasonable for the distance between vectors by reviewing the data(from the join) or playing around with approxNearestNeighbors. If you want 10 nearest neighbors here's how you can find there distance:

NumberOfNeigthbors = 10
CandidateVector = Vectors.dense([1.0, 2.0])
model.approxNearestNeighbors(df2, CandidateVector, NumberOfNeigthbors).collect()
[Row(id=4, features=DenseVector([2.0, 2.0]), hashes=[DenseVector([1.0])], distCol=1.0)]
Storied answered 14/6, 2022 at 20:40 Comment(9)
In Spark, there is a bucket length and number of hash tables. Are these the same as 'r' and 'b' respectively? Any insights in how to group by the multiple hash values to decide which comparisons to make, given there are more that 1?Turbo
Respectively, yes. I've updated my answer to try and address your additional questions.Storied
Thanks!. I was hoping to conduct a cosine similarity. My thought was to run LSH and then do so (with a UDF) between all pairs of rows within a bucket. It seems though that with multiple hashes there is no single bucket definition from which to group by to then calculate the similarity for all pairs,.Turbo
Updated answer with further explanationStoried
Did I help answer your question? Could you mark it correct when you have time or ask further questions so I can help answer your question?Storied
Sorry for the delay - yes, except about how to deal with the fact there are multiple buckets to compare. I guess maybe for a given value of dataA we need to see which buckets it is in and check each?Turbo
model.approxSimilarityJoin does the work for you. Any match in any bucket is worth examining as a close target.Storied
It relies on Euclidean distance though right?Turbo
Yes: spark.apache.org/docs/2.2.3/…Storied

© 2022 - 2024 — McMap. All rights reserved.