I am working with TFIDF sparse matrices for document classification and want to retain only the top n (say 50) terms for each document (ranked by TFIDF score). See EDIT below.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
n = 50
df = pd.read_pickle('my_df.pickle')
df_t = tfidfvectorizer.fit_transform(df['text'])
df_t
Out[15]:
<21175x201380 sparse matrix of type '<class 'numpy.float64'>'
with 6055621 stored elements in Compressed Sparse Row format>
I have tried following the example in this post, although my aim is not to display the features, but just to select the top n for each document before training. But I get a memory error as my data is too large to be converted into a dense matrix.
df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]
Traceback (most recent call last):
File "<ipython-input-16-e0a74c393ca5>", line 1, in <module>
df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]
File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 943, in toarray
out = self._process_toarray_args(order, out)
File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\base.py", line 1130, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Is there any way to do what I want without working with a dense representation (i.e. without the toarray()
call) and without reducing the feature space too much more than I already have (with min_df)?
Note: the max_features
parameter is not what I want as it only considers "the top max_features ordered by term frequency across the corpus" (docs here) and what I want is a document-level ranking.
EDIT: I wonder if the best way to address this problem is to set the values of all features except the n-best to zero. I say this because the vocabulary has already been calculated, so feature indices must remain the same, as I will want to use them for other purposes (e.g. to visualise the actual words that correspond to the n-best features).
A colleague wrote some code to retrieve the indices of the n highest-ranked features:
n = 2
tops = np.zeros((df_t.shape[0], n), dtype=int) # store the top indices in a new array
for ind in range(df_t.shape[0]):
tops[ind,] = np.argsort(-df_t[ind].toarray())[0, 0:n] # for each row (i.e. document) sort the (inversed, as argsort is ascending) list and slice top n
But from there, I would need to either:
- retrieve the list of remaining (i.e. lowest-ranked) indices and modify the values "in place", or
- loop through the original matrix (
df_t
) and set all values to 0 except for the n best indices intops
.
There is a post here explaining how to work with a csr_matrix, but I'm not sure how to put this into practice to get what I want.
SelectKBest
is what I am after as it seems to calculate k-best features across the entire corpus (so documents which don't contain any of the k-best terms are represented with just zeros). What I want to do is rank the features for each document by descending order of TFIDF score and then select the k top features (like doing a sort then slice with a list). – Baelbeer