Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

Asked 11/12, 2015 at 20:39 Answered 24/4, 2021 at 21:40

Solved python scikit-learn nlp nltk tf-idf

I am working on keyword extraction problem. Consider the very general case

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')

t = """Two Travellers, walking in the noonday sun, sought the shade of a widespreading tree to rest. As they lay looking up among the pleasant leaves, they saw that it was a Plane Tree.

"How useless is the Plane!" said one of them. "It bears no fruit whatever, and only serves to litter the ground with leaves."

"Ungrateful creatures!" said a voice from the Plane Tree. "You lie here in my cooling shade, and yet you say I am useless! Thus ungratefully, O Jupiter, do men receive their blessings!"

Our best blessings are often the least appreciated."""

tfs = tfidf.fit_transform(t.split(" "))
str = 'tree cat travellers fruit jupiter'
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()

for col in response.nonzero()[1]:
    print(feature_names[col], ' - ', response[0, col])

and this gives me

  (0, 28)   0.443509712811
  (0, 27)   0.517461475101
  (0, 8)    0.517461475101
  (0, 6)    0.517461475101
tree  -  0.443509712811
travellers  -  0.517461475101
jupiter  -  0.517461475101
fruit  -  0.517461475101

which is good. For any new document that comes in, is there a way to get the top n terms with the highest tfidf score?

Ahner answered 11/12, 2015 at 20:39 Comment(1)

You probably shouldn't overwrite the Python datatype str. – Piercing 27/9, 2017 at 2:24

You have to do a little bit of a song and dance to get the matrices as numpy arrays instead, but this should do what you're looking for:

feature_array = np.array(tfidf.get_feature_names())
tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]

n = 3
top_n = feature_array[tfidf_sorting][:n]

This gives me:

array([u'fruit', u'travellers', u'jupiter'], 
  dtype='<U13')

The argsort call is really the useful one, here are the docs for it. We have to do [::-1] because argsort only supports sorting small to large. We call flatten to reduce the dimensions to 1d so that the sorted indices can be used to index the 1d feature array. Note that including the call to flatten will only work if you're testing one document at at time.

Also, on another note, did you mean something like tfs = tfidf.fit_transform(t.split("\n\n"))? Otherwise, each term in the multiline string is being treated as a "document". Using \n\n instead means that we are actually looking at 4 documents (one for each line), which makes more sense when you think about tfidf.

Radtke answered 12/12, 2015 at 3:44 Comment(8)

How would I achieve that by using DictVectorizer + TfidfTransformer? – Ignaciaignacio 1/11, 2016 at 23:59

What if we want to list top n terms for each class not for each document? I asked a question here but no response yet! – Carolynecarolynn 30/6, 2017 at 17:10

What do you mean, "for each class". Say you have documents labeled "A" and documents labeled "B". You could either: (1) calculate "TF-ICF", which gives you the term frequency-inverse class frequency. Do that just by concatenating all of the documents of a class into a single string and doing the normal tfidf process. (2) Alternatively, you could calculate average TFIDF, by adding the class as a column to the TFIDF, creating a pandas dataframe, and then doing tfidf_dataframe.groupby('class_name').mean(). – Radtke 30/6, 2017 at 18:41

Strangely, The last line gives memory errors , while replacing it to top_n = feature_array[tfidf_sorting[:n]] it doesn't . – Avertin 25/11, 2018 at 17:1

@Pedram, I asked the same question (#56703744) for per class you mention it at your comment above. Do you have an answer to it? – Libreville 21/6, 2019 at 15:22

By the way, @Radtke this line tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1] gives me a memory error which must be because my tf-idf matrix is too big. So I guess that I could do this in batches? – Libreville 21/6, 2019 at 15:23

I haven't looked into this at all, but casting tfidf.get_feature_names() as an numpy.array uses massively more memory than the default Python list. My 300mb TFIDF model turns into 4+ Gb in RAM when I call numpy.array on get_feature_names(), whereas simply using feature_array = tfidf.get_feature_names() works fine and uses very little RAM. – Blasto 5/8, 2019 at 20:30

@Blasto feature_array = tfidf.get_feature_names() worked for me – Kokoruda 28/4, 2020 at 10:24

Solution using sparse matrix itself (without .toarray())!

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')
corpus = [
    'I would like to check this document',
    'How about one more document',
    'Aim is to capture the key words from the corpus',
    'frequency of words in a document is called term frequency'
]

X = tfidf.fit_transform(corpus)
feature_names = np.array(tfidf.get_feature_names())


new_doc = ['can key words in this new document be identified?',
           'idf is the inverse document frequency caculcated for each of the words']
responses = tfidf.transform(new_doc)


def get_top_tf_idf_words(response, top_n=2):
    sorted_nzs = np.argsort(response.data)[:-(top_n+1):-1]
    return feature_names[response.indices[sorted_nzs]]
  
print([get_top_tf_idf_words(response,2) for response in responses])

#[array(['key', 'words'], dtype='<U9'),
 array(['frequency', 'words'], dtype='<U9')]

Chthonian answered 22/6, 2019 at 7:57 Comment(2)

It returns the repetitive words also, When I am trying to use these top n words as my vocabulary in tfidfvectorizer again, it throws and value error with as there are duplicate words in vocab. How will I get top n unique words? – Kyrstin 20/4, 2020 at 10:0

Interesting. I am using get_feature_names() to get the feature_names, hence there should not be any duplicates returned by get_top_tf_idf_words. Can you post a new question, with a reproducible example and tag me? – Chthonian 20/4, 2020 at 10:54

Here is a quick code for that: (documents is a list)

def get_tfidf_top_features(documents,n_top=10):
  tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,  stop_words='english')
  tfidf = tfidf_vectorizer.fit_transform(documents)
  importance = np.argsort(np.asarray(tfidf.sum(axis=0)).ravel())[::-1]
  tfidf_feature_names = np.array(tfidf_vectorizer.get_feature_names())
  return tfidf_feature_names[importance[:n_top]]

Oakleil answered 24/4, 2021 at 21:40 Comment(3)

There is a typo in the second line. The first character "t" is missing. – Wellington 9/9, 2021 at 12:15

no_features is missing variable. – Oe 11/3, 2022 at 5:52

Thanks for the notes. Fixed. – Oakleil 7/11, 2022 at 23:31

Recommended topics

Hot tags