Select top n TFIDF features for a given document
Asked Answered
B

3

10

I am working with TFIDF sparse matrices for document classification and want to retain only the top n (say 50) terms for each document (ranked by TFIDF score). See EDIT below.

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english', 
                              token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
n = 50

df = pd.read_pickle('my_df.pickle')
df_t = tfidfvectorizer.fit_transform(df['text'])

df_t
Out[15]: 
<21175x201380 sparse matrix of type '<class 'numpy.float64'>'
    with 6055621 stored elements in Compressed Sparse Row format>

I have tried following the example in this post, although my aim is not to display the features, but just to select the top n for each document before training. But I get a memory error as my data is too large to be converted into a dense matrix.

df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]
Traceback (most recent call last):

  File "<ipython-input-16-e0a74c393ca5>", line 1, in <module>
    df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]

  File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 943, in toarray
    out = self._process_toarray_args(order, out)

  File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\base.py", line 1130, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)

MemoryError

Is there any way to do what I want without working with a dense representation (i.e. without the toarray() call) and without reducing the feature space too much more than I already have (with min_df)?

Note: the max_features parameter is not what I want as it only considers "the top max_features ordered by term frequency across the corpus" (docs here) and what I want is a document-level ranking.

EDIT: I wonder if the best way to address this problem is to set the values of all features except the n-best to zero. I say this because the vocabulary has already been calculated, so feature indices must remain the same, as I will want to use them for other purposes (e.g. to visualise the actual words that correspond to the n-best features).

A colleague wrote some code to retrieve the indices of the n highest-ranked features:

n = 2
tops = np.zeros((df_t.shape[0], n), dtype=int) # store the top indices in a new array
for ind in range(df_t.shape[0]):
    tops[ind,] = np.argsort(-df_t[ind].toarray())[0, 0:n] # for each row (i.e. document) sort the (inversed, as argsort is ascending) list and slice top n

But from there, I would need to either:

  1. retrieve the list of remaining (i.e. lowest-ranked) indices and modify the values "in place", or
  2. loop through the original matrix (df_t) and set all values to 0 except for the n best indices in tops.

There is a post here explaining how to work with a csr_matrix, but I'm not sure how to put this into practice to get what I want.

Baelbeer answered 24/10, 2018 at 15:7 Comment(0)
C
6
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(tokenizer=word_tokenize,ngram_range=(1,2), binary=True, max_features=50)
TFIDF=vect.fit_transform(df['processed_cv_data'])

The max_features parameter passed in the TfidfVectorizer will pick out the top 50 features ordered by their term frequency but not by their Tf-idf score. You can view the features by using:

print(vect.get_feature_names())
Chapen answered 25/5, 2019 at 20:31 Comment(0)
H
3

As you mention, the max_features parameter of the TfidfVectorizer is one way of selecting features.

If you are looking for an alternative way which takes the relationship to the target variable into account, you can use sklearn's SelectKBest. By setting k=50, this will filter your data for the best features. The metric to use for selection can be specified as the parameter score_func.

Example:

from sklearn.feature_selection import SelectKBest

tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english', 
                          token_pattern='[A-Za-z][\w\-]*', max_df=0.25)

df_t = tfidfvectorizer.fit_transform(df['text'])
df_t_reduced = SelectKBest(k=50).fit_transform(df_t, df['target'])

You can also chain it in a pipeline:

pipeline = Pipeline([("vectorizer", TfidfVectorizer()),
                     ("feature_reduction", SelectKBest(k=50)),
                     ("classifier", classifier)])
Hayrick answered 24/10, 2018 at 15:28 Comment(2)
Thanks for this, but (unless I'm mistaken) it doesn't look like SelectKBest is what I am after as it seems to calculate k-best features across the entire corpus (so documents which don't contain any of the k-best terms are represented with just zeros). What I want to do is rank the features for each document by descending order of TFIDF score and then select the k top features (like doing a sort then slice with a list).Baelbeer
@ogenz, sorry I didn't understand that is what you wanted to do. I'll leave my answer in case it helps anyone else.Hayrick
D
1

You could break your numpy array in multiple one to free the memory. Then just concat them

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='train').data

tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english', 
                                  token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
df_t = tfidfvectorizer.fit_transform(data)

n = 10

df_t = tfidfvectorizer.fit_transform(data)

df_top = [np.argsort(df_t[i: i+500, :].toarray(), axis=1)[:, :n]
          for i in range(0, df_t.shape[0], 500)]

np.concatenate(df_top, axis=0).shape
>> (11314, 10)
Deary answered 25/10, 2018 at 12:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.