Apache Spark Python Cosine Similarity over DataFrames

About

Asked 11/5, 2017 at 17:2 Answered 11/5, 2017 at 17:46

python apache-spark pyspark apache-spark-sql cosine-similarity

For a Recommender System, I need to compute the cosine similarity between all the columns of a whole Spark DataFrame.

In Pandas I used to do this:

import sklearn.metrics as metrics
import pandas as pd

df= pd.DataFrame(...some dataframe over here :D ...)
metrics.pairwise.cosine_similarity(df.T,df.T)

That generates the Similarity Matrix between the columns (since I used the transposition)

Is there any way to do the same thing in Spark (Python)?

(I need to apply this to a matrix made of tens of millions of rows, and thousands of columns, so that's why I need to do it in Spark)

Beeck answered 11/5, 2017 at 17:2 Comment(0)

You can use the built-in columnSimilarities() method on a RowMatrix, that can both calculate the exact cosine similarities, or estimate it using the DIMSUM method, which will be considerably faster for larger datasets. The difference in usage is that for the latter, you'll have to specify a threshold.

Here's a small reproducible example:

from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])

# Convert to RowMatrix
mat = RowMatrix(rows)

# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)

# Output
exact.entries.collect()
[MatrixEntry(0, 2, 0.991935352214),
 MatrixEntry(1, 2, 0.998441152599),
 MatrixEntry(0, 1, 0.997463284056)]

Jangro answered 11/5, 2017 at 17:46 Comment(5)

How can I do over the rows instead of the columns? – Freestone 12/10, 2017 at 11:0

@Jangro Do you know how to implement the same in Scala #47010626 – Dualistic 30/10, 2017 at 7:53

Can you interpret the results of the matrixEntry? like what is 0 and 2? – Palatinate 8/1, 2018 at 15:46

First 2 elements of MatrixEntry correspond to row index and column index. spark.apache.org/docs/2.1.0/api/java/org/apache/spark/mllib/… – Michaels 16/8, 2019 at 1:33

@Freestone in order to do it over rows, you need to get a transpose of the RowMatrix. – Carnegie 27/7, 2020 at 11:57

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags