Python TfidfVectorizer throwing : empty vocabulary; perhaps the documents only contain stop words"
Asked Answered
J

4

21

I'm trying to use Python's Tfidf to transform a corpus of text. However, when I try to fit_transform it, I get a value error ValueError: empty vocabulary; perhaps the documents only contain stop words.

In [69]: TfidfVectorizer().fit_transform(smallcorp)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-69-ac16344f3129> in <module>()
----> 1 TfidfVectorizer().fit_transform(smallcorp)

/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
   1217         vectors : array, [n_samples, n_features]
   1218         """
-> 1219         X = super(TfidfVectorizer, self).fit_transform(raw_documents)
   1220         self._tfidf.fit(X)
   1221         # X is already a transformed view of raw_documents so

/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
    778         max_features = self.max_features
    779 
--> 780         vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
    781         X = X.tocsc()
    782 

/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
    725             vocabulary = dict(vocabulary)
    726             if not vocabulary:
--> 727                 raise ValueError("empty vocabulary; perhaps the documents only"
    728                                  " contain stop words")
    729 

ValueError: empty vocabulary; perhaps the documents only contain stop words

I read through the SO question here: Problems using a custom vocabulary for TfidfVectorizer scikit-learn and tried ogrisel's suggestion of using TfidfVectorizer(**params).build_analyzer()(dataset2) to check the results of the text analysis step and that seems to be working as expected: snippet below:

In [68]: TfidfVectorizer().build_analyzer()(smallcorp)
Out[68]: 
[u'due',
 u'to',
 u'lack',
 u'of',
 u'personal',
 u'biggest',
 u'education',
 u'and',
 u'husband',
 u'to',

Is there something else that I am doing wrong? the corpus I am feeding it is just one giant long string punctuated by newlines.

Thanks!

Januarius answered 5/1, 2014 at 1:0 Comment(1)
I had the same problem and downgraded from v0.19 to 0.18Asarum
P
22

I guess it's because you just have one string. Try splitting it into a list of strings, e.g.:

In [51]: smallcorp
Out[51]: 'Ah! Now I have done Philosophy,\nI have finished Law and Medicine,\nAnd sadly even Theology:\nTaken fierce pains, from end to end.\nNow here I am, a fool for sure!\nNo wiser than I was before:'

In [52]: tf = TfidfVectorizer()

In [53]: tf.fit_transform(smallcorp.split('\n'))
Out[53]: 
<6x28 sparse matrix of type '<type 'numpy.float64'>'
    with 31 stored elements in Compressed Sparse Row format>
Paley answered 5/1, 2014 at 13:6 Comment(2)
This should be the correct answer. Any link to the documentation about this? I can't find it anywhereFerula
There is an example in this document. scikit-learn.org/stable/modules/… (In section 4.2.3.3, a corpus variable)Halette
P
4

In version 0.12, we set the minimum document frequency to 2, which means that only words that appear at least twice will be considered. For your example to work, you need to set min_df=1. Since 0.13, this is the default setting. So I guess you are using 0.12, right?

Paryavi answered 6/1, 2014 at 9:20 Comment(0)
F
0

You can alternatively put your single string as a tuple, if you insist to have only one string. Instead of having:

smallcorp = "your text"

you'd rather put it within a tuple.

In [22]: smallcorp = ("your text",)
In [23]: tf.fit_transform(smallcorp)
Out[23]: 
<1x2 sparse matrix of type '<type 'numpy.float64'>'
    with 2 stored elements in Compressed Sparse Row format>
Feminacy answered 28/4, 2016 at 23:38 Comment(0)
L
0

I also had the same problem. Transform list of int(nums) to list of str(nums) didn't help. But I converted to:

['d'+str(nums) for nums in set] #where d is some letter which mention, we work with strings

This helped.

Laban answered 3/11, 2020 at 11:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.