Pyspark LSH Followed by Cosine Similarity

from pyspark.ml.feature import BucketedRandomProjectionLSH from pyspark.ml.linalg import Vectors from pyspark.sql.functions import col dataA = [(0, Vectors.dense([1.0, 1.0]),), (1, Vectors.dense([1.0, -1.0]),), (4, Vectors.dense([1.0, -1.0]),), (5, Vectors.dense([1.1, -1.0]),), (2, Vectors.dense([-1.0, -1.0]),), (3, Vectors.dense([-1.0, 1.0]),)] dfA = ss.createDataFrame(dataA, ["id", "features"]) brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", bucketLength=1.0, numHashTables=3) model = brp.fit(dfA) model.transform(dfA).show(truncate=False) +---+-----------+-----------------------+ |id |features |hashes | +---+-----------+-----------------------+ |0 |[1.0,1.0] |[[-1.0], [0.0], [-1.0]]| |1 |[1.0,-1.0] |[[-2.0], [-2.0], [1.0]]| |4 |[1.0,-1.0] |[[-2.0], [-2.0], [1.0]]| |5 |[1.1,-1.0] |[[-2.0], [-2.0], [1.0]]| |2 |[-1.0,-1.0]|[[0.0], [-1.0], [0.0]] | |3 |[-1.0,1.0] |[[1.0], [1.0], [-2.0]] | +---+-----------+-----------------------+

I assumed the use of LSH for this task is to group by the value in the "hashes" column and only perform pairwise similarity within each. Is this correct?

Yes, LSH uses a method to reduce dimensionality while preserving similarity. It hashes your data into a bucket. Only items that end up in the same bucket are then compared.(Distance calculated)

The magic is tuning the number of buckets and hash functions to reduce the number of false positives and false negatives. There isn't a set number it depends on your data.

r is your bucket size, b is the number of hash functions to use(Or the number of buckets you will be using to detect matches.

From this article that helped me understand what was happening.

Let’s say your signature matrix has 100 rows. Consider 2 cases:

b1 = 10 → r = 10

b2 = 20 → r = 5

In 2nd case, there is higher chance for 2 [vectors] to appear in same bucket at least once as they have more opportunities (20 vs 10) and fewer elements of the signature are getting compared (5 vs 10)

If you need to join you can use: approxSimilarityJoin and set the acceptable distance. (This is another parameter that you need to tune, Distance is the distance between vectors that have fallen into at least on of the hash buckets, making them likely to be close to eachother.)

distance = 300

model.approxSimilarityJoin(df, df2, distance, distCol="EuclideanDistance").select(
    col("datasetA.id").alias("idA"),
    col("datasetB.id").alias("idB"),
    col("EuclideanDistance")).show()

You can get an idea of what reasonable for the distance between vectors by reviewing the data(from the join) or playing around with approxNearestNeighbors. If you want 10 nearest neighbors here's how you can find there distance:

NumberOfNeigthbors = 10
CandidateVector = Vectors.dense([1.0, 2.0])
model.approxNearestNeighbors(df2, CandidateVector, NumberOfNeigthbors).collect()
[Row(id=4, features=DenseVector([2.0, 2.0]), hashes=[DenseVector([1.0])], distCol=1.0)]

Recommended topics

Hot tags