adding words to stop_words list in TfidfVectorizer in sklearn
Asked Answered
M

3

31

I want to add a few more words to stop_words in TfidfVectorizer. I followed the solution in Adding words to scikit-learn's CountVectorizer's stop list . My stop word list now contains both 'english' stop words and the stop words I specified. But still TfidfVectorizer does not accept my list of stop words and I can still see those words in my features list. Below is my code

from sklearn.feature_extraction import text
my_stop_words = text.ENGLISH_STOP_WORDS.union(my_words)

vectorizer = TfidfVectorizer(analyzer=u'word',max_df=0.95,lowercase=True,stop_words=set(my_stop_words),max_features=15000)
X= vectorizer.fit_transform(text)

I have also tried to set stop_words in TfidfVectorizer as stop_words=my_stop_words . But still it does not work . Please help.

Monohydric answered 9/11, 2014 at 7:24 Comment(10)
I did use your code and ran as here. I got the expected Result. Can you provide more details?Bemused
I am classifying tweets which contain urls. Now my features which I extract using SelectKBest contains those urls in pieces. So I thought of adding those urls in my stop word list so that it gets removed from my feature set. I added those urls as shown above.Monohydric
Here is how my stop word list looks like : frozenset(['', 'wA4qNj2o0b', 'all', 'fai5w3nBgo', 'Ikq7p9ElUW', '9W6GbM0MjL', 'four', 'WkOI43bsVj', 'x88VDFBzkO', 'whose', 'YqoLBzajjo', 'NVXydiHKSC', 'HdjXav51vI', 'q0YoiC0QCD', 'to', 'cTIYpRLarr', 'nABIG7dAlr', 'under', '6JF33FZIYU', 'very', 'AVFWjAWsbF'])Monohydric
And here is how my feature set looks like : [u'bcvjby2owk', u'cases bcvjby2owk', u'cases dgvsrqaw7p', u'dgvsrqaw7p', u'8dsto3yxi2', u'guardianafrica', u'guardianafrica guardian\xe2', u'guardianafrica guardian\xe2 nickswicks']Monohydric
I could see that none of the stop words are appearing in the feature lists. So, reported behaviour is expected. Here, method used to filtering these hashes is wrong. If you pass random strings to vectorizer as stop words, it wont intelligently filter similar strings. Stop words are the exact/hard-coded strings to be filtered. Alternatively, you can use regex (before passing the text block to vectorizer) to filter all the urls which are not required. This may solve your problem with urls.Bemused
I think my example was a bit confusing...sorry about that. I have hardcoded each and every string in my_stop_words list, even then these string pops up in the feature list, just in lowercase as I have set lowercase=True in TfIdfVectorizer function.Monohydric
I think I found the problem. Its the lowercase=True parameter. All the strings in feature list is converted to lowercase but the strings in my_word_list is still case sensitive. So these were not removed from the feature list even if the same were present in my_word_list. Thanks for your help though.Monohydric
@Monohydric It didn't work for me. What version of sklearn are you using?Lailalain
Hey... this was a course project I did in November last year. I even uninstalled sklearn. I don't know how else I can check that version. Sorry.Monohydric
Possible duplicate of Adding words to scikit-learn's CountVectorizer's stop listAntoniettaantonin
B
27

This is how you can do it:

from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer

my_stop_words = text.ENGLISH_STOP_WORDS.union(["book"])

vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words)

X = vectorizer.fit_transform(["this is an apple.","this is a book."])

idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))

# printing the tfidf vectors
print(X)

# printing the vocabulary
print(vectorizer.vocabulary_)

In this example, I created the tfidf vectors for two sample documents:

"This is a green apple."
"This is a machine learning book."

By default, this, is, a, and an are all in the ENGLISH_STOP_WORDS list. And, I also added book to the stop word list. This is the output:

(0, 1)  0.707106781187
(0, 0)  0.707106781187
(1, 3)  0.707106781187
(1, 2)  0.707106781187
{'green': 1, 'machine': 3, 'learning': 2, 'apple': 0}

As we can see, the word book is also removed from the list of features because we listed it as a stop word. As a result, tfidfvectorizer did accept the manually added word as a stop word and ignored the word at the time of creating the vectors.

Bounce answered 14/7, 2017 at 6:30 Comment(2)
is there a way to remove stopwords from the ENGLISH_STOP_WORDS instead of adding them e.g. remove 'not' ?Corollary
@StamatisTiniakos There should be. ENGLISH_STOP_WORDS is of type: <class 'frozenset'>, so just as an example, you can use this set to create a new list and add or remove words from the list and then pass it to your vectorizer.Bounce
S
5

This is answered here: https://mcmap.net/q/415419/-adding-words-to-scikit-learn-39-s-countvectorizer-39-s-stop-list

Even though sklearn.feature_extraction.text.ENGLISH_STOP_WORDS is a frozenset, you can make a copy of it and add your own words, then pass that variable in to the stop_words argument as a list.

Storekeeper answered 1/2, 2017 at 14:6 Comment(0)
F
0

For use with scikit-learn you can always use a list as-well:

from nltk.corpus import stopwords
stop = list(stopwords.words('english'))
stop.extend('myword1 myword2 myword3'.split())


vectorizer = TfidfVectorizer(analyzer = 'word',stop_words=set(stop))
vectors = vectorizer.fit_transform(corpus)
...

The only downside of this method, over a set is that your list may end up containing duplicates, which is why I then convert it back when using it as an argument for TfidfVectorizer

Frae answered 9/3, 2020 at 23:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.