How to efficiently retrieve top K-similar document by cosine similarity using python?
Asked Answered
S

1

7

I am handling one hundred thousand(100,000) documents(mean document length is about 500 terms). For each document, I want to get the top k (e.g. k = 5) similar documents by cosine similarity. So how to efficiently do this by Python.

Here is what I did:

  1. for each document, do text segmentation, remove stop words, count term frequency(tf)
  2. so we get tf matrix, about 100,000 docs * 600000 terms
  3. do 1 - pairwise_distances(tf_matrix, metric = "cosine")
  4. for each document, get top k similar documents.

I run my code on i5-2.5GHz, 12 hours passed but it still working. So i want to know how to optimize my code or procedure.

Here is my thought:

  1. for each document, do feature selection, just keep terms whose tf > 1
  2. do clustering first, then compute cosine similarity within each cluster
  3. since i just need top k similar documents, do i need to compute all pairwise cosine similarity?
  4. python GPU programming or paralleling programming?

So, do you have any good idea?

Many thanks.


I know there is a similar question, but that's not what i want.


UPDATE1

Thanks to @orange , After profiling, I found that step 2 was bottleneck! Here is the sample code:

def construct_dt_matrix():
    dt_matrix = pd.DataFrame(columns=['docid'])
    docid = 0
    for f in files:
        # text segmentation for f
        # remove stop words
        # word count store in cleaned_dict = {'word': tf}
        dt_matrix.loc[docid] = [0] * dt_matrix.shape[1] # add one row, init all 0
        dt_matrix.set_value(docid, 'docid', docid)
        for key, value in cleaned_dict.items():
            if key not in dt_matrix.columns.values:
                dt_matrix[key] = 0 # add one column, init all 0
            dt_matrix.set_value(docid, key, value) # bottleneck
        docid += 1

So, the bottleneck is adding new rows and columns to pandas. Any idea?

Startle answered 24/12, 2015 at 3:44 Comment(6)
Have you tried it on a smaller dataset and perhaps used a profiler to find and optimise the hotspot in your code? Have a look at RunSnakeRun.Ruction
@Ruction Thanks to your advice, i found the bottleneck and have updated the description. Any idea?Startle
self.dt_matrix.set_value(docid, key, value) looks like a bug. This sets the same value over and over again (to index docid which gets incremented after cleaned_dict was iterated over and column key).Ruction
Perhaps read some tutorials on Pandas. Your understanding of it may not be accurate (many of them explain how it works and why it's fast which I think is required).Ruction
Sorry, code was extracted from a class, I have delete self. The loop is correct, i first add a new row filled with all 0, and for each key, fill the key column with value. Maybe it is inefficient to add row and column like this. Anyway, thanks.Startle
You may find it faster to use CountVectorizer than building your own matrices.Baluchistan
R
0

Pandas DataFrames (and the underlying numpy) are only really fast if you assign arrays of data at once. set_value requires a call for each cell in your matrix! You can do dt_matrix = pd.DataFrame(cleaned_dict) and you have a DataFrame with one function call (ignoring Pandas internal calls).

Try instead:

dt_matrix = pd.DataFrame()

for docid, f in enumerate(files):
    dt_matrix_file = pd.DataFrame(cleaned_dict)
    dt_matrix_file['docid'] = docid
    dt_matrix = dt_matrix.append(dt_matrix_file)

This should be orders of magnitude faster.

If you required NaN cells to be zero, you can do a dt_matrix.fillna(0) (again, one call instead of potentially n * m).

Ruction answered 24/12, 2015 at 6:11 Comment(2)
Thanks first. I tried DataFrame.append(), it is really faster than set_value, but not that fast. Inspired by you, can we first get all column names, and just add new rows to DataFrame. Maybe append need join, so it still takes some time.Startle
It's not just append that makes it faster, but also the creation of the DataFrame. and avoiding to iterate over the dict.Ruction

© 2022 - 2024 — McMap. All rights reserved.