I am handling one hundred thousand(100,000) documents(mean document length is about 500 terms). For each document, I want to get the top k (e.g. k = 5) similar documents by cosine similarity. So how to efficiently do this by Python.
Here is what I did:
- for each document, do text segmentation, remove stop words, count term frequency(tf)
- so we get tf matrix, about 100,000 docs * 600000 terms
- do 1 - pairwise_distances(tf_matrix, metric = "cosine")
- for each document, get top k similar documents.
I run my code on i5-2.5GHz, 12 hours passed but it still working. So i want to know how to optimize my code or procedure.
Here is my thought:
- for each document, do feature selection, just keep terms whose tf > 1
- do clustering first, then compute cosine similarity within each cluster
- since i just need top k similar documents, do i need to compute all pairwise cosine similarity?
- python GPU programming or paralleling programming?
So, do you have any good idea?
Many thanks.
I know there is a similar question, but that's not what i want.
UPDATE1
Thanks to @orange , After profiling, I found that step 2 was bottleneck! Here is the sample code:
def construct_dt_matrix():
dt_matrix = pd.DataFrame(columns=['docid'])
docid = 0
for f in files:
# text segmentation for f
# remove stop words
# word count store in cleaned_dict = {'word': tf}
dt_matrix.loc[docid] = [0] * dt_matrix.shape[1] # add one row, init all 0
dt_matrix.set_value(docid, 'docid', docid)
for key, value in cleaned_dict.items():
if key not in dt_matrix.columns.values:
dt_matrix[key] = 0 # add one column, init all 0
dt_matrix.set_value(docid, key, value) # bottleneck
docid += 1
So, the bottleneck is adding new rows and columns to pandas. Any idea?
self.dt_matrix.set_value(docid, key, value)
looks like a bug. This sets the same value over and over again (to indexdocid
which gets incremented aftercleaned_dict
was iterated over and columnkey
). – Ructionself
. The loop is correct, i first add a new row filled with all 0, and for each key, fill thekey
column withvalue
. Maybe it is inefficient to add row and column like this. Anyway, thanks. – StartleCountVectorizer
than building your own matrices. – Baluchistan