How tf-idf model handles unseen words during test-data?
Asked Answered
Q

1

8

I have read many blogs but was not satisfied with the answers, Suppose I train tf-idf model on few documents example:

   " John like horror movie."
   " Ryan watches dramatic movies"
    ------------so on ----------

I use this function:

   from sklearn.feature_extraction.text import TfidfTransformer
   count_vect = CountVectorizer()
   X_train_counts = count_vect.fit_transform(twenty_train.data)
   X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
   print((X_train_counts.todense()))
   # Gives count of words in each document

   But it doesn't tell which word? How to get words as headers in X_train_counts 
  outputs. Similarly in X_train_tfidf ?

So X_train_tfidf output will be matrix with tf-idf score:

     Horror  watch  movie  drama
doc1  score1  --    -----------
doc2   ------------------------

Is this correct?

What does fit does and what does transformation does? In sklearn it is mentioned that:

fit(..) method to fit our estimator to the data and secondly the transform(..) method to transform our count-matrix to a tf-idf representation. What does estimator to the data means?

Now suppose new test document comes:

    " Ron likes thriller movies"

How to convert this document to tf-idf? We can't convert it to tf-idf right? How to handle word thriller which is not there in train document.

Quita answered 14/10, 2019 at 7:2 Comment(1)
Using BPE (Byte Pair encoding) would be a way to handle out-of-vocabulary items. You'd be using subwords instead of or as well as words. It's not a new technique but has recently been popularized in LLMs.Highspeed
O
11

taking two text as input

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

text = ["John like horror movie","Ryan watches dramatic movies"]

count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_train_counts = count_vect.fit_transform(text)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

pd.DataFrame(X_train_tfidf.todense(), columns = count_vect.get_feature_names())

o/p

        dramatic    horror      john        like        movie       movies      ryan    watches
   0    0.000000    0.471078    0.471078    0.471078    0.471078    0.335176    0.000000    0.000000
   1    0.363788    0.000000    0.000000    0.000000    0.000000    0.776515    0.363788    0.363788

Now testing it for new comment , we need to use transform function , the word which are out of vocabulary will get ignored while vectorizing it.

new_comment = ["ron don't like dramatic movie"]

pd.DataFrame(tfidf_transformer.transform(count_vect.transform(new_comment)).todense(), columns = count_vect.get_feature_names())


    dramatic    horror  john    like    movie   movies  ryan    watches
0   0.57735      0.0    0.0    0.57735  0.57735   0.0   0.0      0.0

if you want to use vocabulary of certain word, than prepare list of word that you want to use , and keep appending new word to this list and pass list to CountVectorizer

 vocabulary = ['dramatic', 'movie','horror']
 vocabulary.append('Thriller')
 count_vect = CountVectorizer(vocabulary = vocabulary)
 cont_vect.fit_transform(text)
Osteotomy answered 14/10, 2019 at 13:1 Comment(2)
But if you see it will never include the new word like "Thriller"Quita
If we want to include that word too.. than how can it be done?Quita

© 2022 - 2024 — McMap. All rights reserved.