Find the tf-idf score of specific words in documents using sklearn
Asked Answered
R

3

7

I have code that runs basic TF-IDF vectorizer on a collection of documents, returning a sparse matrix of D X F where D is the number of documents and F is the number of terms. No problem.

But how do I find the TF-IDF score of a specific term in the document? i.e. is there some sort of dictionary between terms (in their textual representation) and their position in the resulting sparse matrix?

Reorientation answered 22/6, 2015 at 9:13 Comment(1)
check the answer, #34449627Subdual
M
11

Yes. See .vocabulary_ on your fitted/transformed TF-IDF vectorizer.

In [1]: from sklearn.datasets import fetch_20newsgroups

In [2]: data = fetch_20newsgroups(categories=['rec.autos'])

In [3]: from sklearn.feature_extraction.text import TfidfVectorizer

In [4]: cv = TfidfVectorizer()

In [5]: X = cv.fit_transform(data.data)

In [6]: cv.vocabulary_

It is a dictionary of the form:

{word : column index in array}

Mou answered 22/6, 2015 at 10:29 Comment(0)
G
8

This is another solution with CountVectorizer and TfidfTransformer that finds Tfidf score for a given word:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
# our corpus
data = ['I like dog', 'I love cat', 'I interested in cat']

cv = CountVectorizer()

# convert text data into term-frequency matrix
data = cv.fit_transform(data)

tfidf_transformer = TfidfTransformer()

# convert term-frequency matrix into tf-idf
tfidf_matrix = tfidf_transformer.fit_transform(data)

# create dictionary to find a tfidf word each word
word2tfidf = dict(zip(cv.get_feature_names(), tfidf_transformer.idf_))

for word, score in word2tfidf.items():
    print(word, score)

Output:

(u'love', 1.6931471805599454)
(u'like', 1.6931471805599454)
(u'i', 1.0)
(u'dog', 1.6931471805599454)
(u'cat', 1.2876820724517808)
(u'interested', 1.6931471805599454)
(u'in', 1.6931471805599454)
Galleywest answered 28/6, 2018 at 8:48 Comment(1)
This only provides the IDF of terms, not the TF-IDF of terms (TF-IDF is specific to a term and one document in the corpus.)Riparian
H
0

@kinkajou, No, TF and IDF are not same but they belong to the same algorithm- TF-IDF, i.e Term frequency Inverse document Frequency

Hindu answered 5/7, 2019 at 8:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.