scikit-learn TfidfVectorizer meaning?
Asked Answered
I

3

25

I was reading about TfidfVectorizer implementation of scikit-learn, i don´t understand what´s the output of the method, for example:

new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play baseball']
new_term_freq_matrix = tfidf_vectorizer.transform(new_docs)
print tfidf_vectorizer.vocabulary_
print new_term_freq_matrix.todense()

output:

{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}
[[ 0.57735027  0.57735027  0.57735027  0.          0.          0.          0.
   0.          0.          0.          0.        ]
 [ 0.          0.68091856  0.          0.          0.51785612  0.51785612
   0.          0.          0.          0.          0.        ]
 [ 0.62276601  0.          0.          0.62276601  0.          0.          0.
   0.4736296   0.          0.          0.        ]]

What is?(e.g.: u'me': 8 ):

{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}

is this a matrix or just a vector?, i can´t understand what´s telling me the output:

[[ 0.57735027  0.57735027  0.57735027  0.          0.          0.          0.
   0.          0.          0.          0.        ]
 [ 0.          0.68091856  0.          0.          0.51785612  0.51785612
   0.          0.          0.          0.          0.        ]
 [ 0.62276601  0.          0.          0.62276601  0.          0.          0.
   0.4736296   0.          0.          0.        ]]

Could anybody explain me in more detail these outputs?

Thanks!

Isidore answered 17/9, 2014 at 23:50 Comment(0)
S
20

TfidfVectorizer - Transforms text to feature vectors that can be used as input to estimator.

vocabulary_ Is a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index.

What is?(e.g.: u'me': 8 )

It tells you that the token 'me' is represented as feature number 8 in the output matrix.

is this a matrix or just a vector?

Each sentence is a vector, the sentences you've entered are matrix with 3 vectors. In each vector the numbers (weights) represent features tf-idf score. For example: 'julie': 4 --> Tells you that the in each sentence 'Julie' appears you will have non-zero (tf-idf) weight. As you can see in the 2'nd vector:

[ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ]

The 5'th element scored 0.51785612 - the tf-idf score for 'Julie'. For more info about Tf-Idf scoring read here: http://en.wikipedia.org/wiki/Tf%E2%80%93idf

Spend answered 18/9, 2014 at 10:42 Comment(2)
what is the u parameter in the output? Using a fresh download of Anaconda/Scikit and it is not showing up. Is it now not displayed in the output?Eyeleteer
FYI - it is the difference between unicode or not (which is specified on versions before Python 3).Eyeleteer
S
8

So tf-idf creates a set of its own vocabulary from the entire set of documents. Which is seen in first line of output. (for better understanding I have sorted it)

{u'baseball': 0, u'basketball': 1, u'he': 2, u'jane': 3, u'julie': 4, u'likes': 5, u'linda': 6,  u'loves': 7, u'me': 8, u'more': 9, u'than': 10, }

And when the document is parsed to get its tf-idf. Document:

He watches basketball and baseball

and its output,

[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0. 0. 0. 0. 0. ]

is equivalent to,

[baseball basketball he jane julie likes linda loves me more than]

Since our document has only these words: baseball, basketball, he, from the vocabulary created. The document vector output has values of tf-idf for only these three words and in the same sorted vocabulary position.

tf-idf is used to classify documents, ranking in search engine. tf: term frequency(count of the words present in document from its own vocabulary), idf: inverse document frequency(importance of the word to each document).

Stheno answered 18/1, 2018 at 15:37 Comment(1)
this one explains better. Thanks, mate.Ringster
B
1

The method addresses the fact that all words should not be weighted equally, using the weights to indicate the words that are most unique to the document, and best used to characterize it.

new_docs = ['basketball baseball', 'basketball baseball', 'basketball baseball']
new_term_freq_matrix = vectorizer.fit_transform(new_docs)
print (vectorizer.vocabulary_)
print ((new_term_freq_matrix.todense()))


{'basketball': 1, 'baseball': 0}
    [[ 0.70710678  0.70710678]
     [ 0.70710678  0.70710678]
     [ 0.70710678  0.70710678]]

new_docs = ['basketball baseball', 'basketball basketball', 'basketball basketball']
new_term_freq_matrix = vectorizer.fit_transform(new_docs)
print (vectorizer.vocabulary_)
print ((new_term_freq_matrix.todense()))

{'basketball': 1, 'baseball': 0}
    [[ 0.861037    0.50854232]
     [ 0.          1.        ]
     [ 0.          1.        ]] 

new_docs = ['basketball basketball baseball', 'basketball basketball', 'basketball 
basketball']
new_term_freq_matrix = vectorizer.fit_transform(new_docs)
print (vectorizer.vocabulary_)
print ((new_term_freq_matrix.todense())) 


{'basketball': 1, 'baseball': 0}
[[ 0.64612892  0.76322829]
[ 0.          1.        ]
[ 0.          1.        ]]
Brannon answered 23/1, 2019 at 21:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.