Calculate TF-IDF using sklearn for n-grams in python

Asked 5/10, 2017 at 8:18 Answered 12/12, 2022 at 6:12

I have a vocabulary list that include n-grams as follows.

myvocabulary = ['tim tam', 'jam', 'fresh milk', 'chocolates', 'biscuit pudding']

I want to use these words to calculate TF-IDF values.

I also have a dictionary of corpus as follows (key = recipe number, value = recipe).

corpus = {1: "making chocolates biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}

I am currently using the following code.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())

Now I am printing tokens or n-grams of the recipe 1 in corpus along with the tF-IDF value as follows.

feature_names = tfidf.get_feature_names()
doc = 0
feature_index = tfs[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tfs[doc, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
  print(w, s)

The results I get is chocolates 1.0. However, my code does not detect n-grams (bigrams) such as biscuit pudding when calculating TF-IDF values. Please let me know where I make the code wrong.

I want to get the TD-IDF matrix for myvocabulary terms by using the recipe documents in the corpus. In other words, the rows of the matrix represents myvocabulary and the columns of the matrix represents the recipe documents of my corpus. Please help me.

Dostie answered 5/10, 2017 at 8:18 Comment(1)

Have a look at tokenizer, token_pattern and ngram_range params in the TfidfVectorizer. – Interdictory 5/10, 2017 at 10:31

Try increasing the ngram_range in TfidfVectorizer:

tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english', ngram_range=(1,2))

Edit: The output of TfidfVectorizer is the TF-IDF matrix in sparse format (or actually the transpose of it in the format you seek). You can print out its contents e.g. like this:

feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
rows, cols = tfs.nonzero()
for row, col in zip(rows, cols):
    print((feature_names[col], corpus_index[row]), tfs[row, col])

which should yield

('biscuit pudding', 1) 0.646128915046
('chocolates', 1) 0.763228291628
('chocolates', 2) 0.508542320378
('tim tam', 2) 0.861036995944
('chocolates', 3) 0.508542320378
('fresh milk', 3) 0.861036995944

If the matrix is not large, it might be easier to examine it in dense form. Pandas makes this very convenient:

import pandas as pd
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index)
print(df)

This results in

                        1         2         3
tim tam          0.000000  0.861037  0.000000
jam              0.000000  0.000000  0.000000
fresh milk       0.000000  0.000000  0.861037
chocolates       0.763228  0.508542  0.508542
biscuit pudding  0.646129  0.000000  0.000000

Comely answered 5/10, 2017 at 17:45 Comment(1)

Many thanks. It works :) Is there a way to view my TF-IDF matrix? "I want to get the TD-IDF matrix for myvocabulary terms by using the recipe documents in the corpus. In other words, the rows of the matrix represents myvocabulary and the columns of the matrix represents the recipe documents of my corpus." – Dostie 5/10, 2017 at 23:3

@user8566323 try using

df = pd.DataFrame(tfs.todense(), index=feature_names, columns=corpus_index)

instead of

df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index)

i.e. without making a transpose (T) of matrix

Harrison answered 12/12, 2022 at 6:12 Comment(0)

Recommended topics

Hot tags