When using the linear_kernel or the cosine_similarity for TfIdfVectorizer I get the error "Kernel died, restarting"
Asked Answered
H

1

6

When using the linear_kernel or the cosine_similarity for TfIdfVectorizer, I get the error "Kernel died, restarting".

I am running the scikit learn functions for TfID method Vectorizer and fit_transform on some text data like the example below, but when I want to calculate the distance matrix, I get the error "Kernel died, restarting".

Whether I use the the cosine_similarity or the linear_kernel function:

tf = TfidfVectorizer(analyzer='word' stop_words='english')
tfidf_matrix = tf.fit_transform(products['ProductDescription'])

 --cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
 --cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

Maybe the problem is the size of my data?

My tiidf matrix is (178350,143529) which should generate a (178350,178350) cosine_sim matrix.

Headstand answered 10/3, 2018 at 20:52 Comment(1)
If you're trying to find cosine similarity, then why not just do tfidf_matrix * tfifdf_matrix.TKizzie
G
0

As per as I understood, you want to calculate N x N similarity table.

In that case (csr matrix is quite large), it is hard to calculate at once, My approach was cosine_similarity(tfidf_matrix[index], tfidf_matrix[:]) * N times.

Actually I performed it with pyspark

def calculate_one_to_all_similarity(index):
    ...
    cosine_similarity(tfidf_matrix[index], tfidf_matrix[:]
rdd.map(lambda r: calculate_one_to_all_similarity(r2index[r]))
Gaeta answered 9/10, 2020 at 4:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.