I have many users and each user has an associated vector. I would like to compute the cosine similarity between each user. This is prohibitive based on the size. It seems LSH is a good approximation step, which I understand will create buckets where the users in this case, are mapped to the same bucket where there is high probability that they are similar. In Pyspark, the following example:
from pyspark.ml.feature import BucketedRandomProjectionLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
dataA = [(0, Vectors.dense([1.0, 1.0]),),
(1, Vectors.dense([1.0, -1.0]),),
(4, Vectors.dense([1.0, -1.0]),),
(5, Vectors.dense([1.1, -1.0]),),
(2, Vectors.dense([-1.0, -1.0]),),
(3, Vectors.dense([-1.0, 1.0]),)]
dfA = ss.createDataFrame(dataA, ["id", "features"])
brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", bucketLength=1.0, numHashTables=3)
model = brp.fit(dfA)
model.transform(dfA).show(truncate=False)
+---+-----------+-----------------------+
|id |features |hashes |
+---+-----------+-----------------------+
|0 |[1.0,1.0] |[[-1.0], [0.0], [-1.0]]|
|1 |[1.0,-1.0] |[[-2.0], [-2.0], [1.0]]|
|4 |[1.0,-1.0] |[[-2.0], [-2.0], [1.0]]|
|5 |[1.1,-1.0] |[[-2.0], [-2.0], [1.0]]|
|2 |[-1.0,-1.0]|[[0.0], [-1.0], [0.0]] |
|3 |[-1.0,1.0] |[[1.0], [1.0], [-2.0]] |
+---+-----------+-----------------------+
Any pointers to how to best set bucketLength and numHashTables are appreciated.
Assuming I have the above with 3 hash tables, how can I determine the buckets from within each to calculate the cosine similarity given that there are more than 1? I assumed the use of LSH for this task is to group by the value in the "hashes" column and only perform pairwise similarity within each. Is this correct?