Adding words to scikit-learn's CountVectorizer's stop list
Asked Answered
T

1

37

Scikit-learn's CountVectorizer class lets you pass a string 'english' to the argument stop_words. I want to add some things to this predefined list. Can anyone tell me how to do this?

Tapley answered 24/6, 2014 at 12:19 Comment(2)
Do you mean you want the default 'english' stop_words plus some extras of your own?Jeromyjerreed
this post has been a life saver.Hellenism
J
67

According to the source code for sklearn.feature_extraction.text, the full list (actually a frozenset, from stop_words) of ENGLISH_STOP_WORDS is exposed through __all__. Therefore if you want to use that list plus some more items, you could do something like:

from sklearn.feature_extraction import text 

stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)

(where my_additional_stop_words is any sequence of strings) and use the result as the stop_words argument. This input to CountVectorizer.__init__ is parsed by _check_stop_list, which will pass the new frozenset straight through.

Jeromyjerreed answered 24/6, 2014 at 12:33 Comment(7)
it's interesting to note there are only 318 stopwords in the set. Maybe these pre-supplied stopwords need to be expanded by the person using it.Guddle
Works very well with CountVectorizer(stop_words = text.ENGLISH_STOP_WORDS.union(array_example))Radiocarbon
I tried to use this code but it did not work for me. Here is a reproducible example: my_text = ['John I hope you like it',"Tyler place is near by"] stop_words =text.ENGLISH_STOP_WORDS.union("john") count_vectorizer = CountVectorizer(stop_words = 'english') vec = count_vectorizer.fit(my_text) bag_of_words = vec.transform(my_text) sum_words = bag_of_words.sum(axis=0) #sum_words is a 1xn_words matrix without labels words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()] sublst = sorted(words_freq, key = lambda x: x[1], reverse=True) sublst still has johnBessie
@Bessie you're not actually using your custom set of stop words...Jeromyjerreed
Being new to python, I am not able to figure out how I can use it. I thought CountVectorizer(stop_words = 'english') means using the stop_words that I already have augmented with my own list. Thx in advance if you can show in code how to actually use it.Bessie
@Bessie note that the original stop words are a frozenset, which is immutable. This creates a new set, it doesn't change the old one.Jeromyjerreed
the updated link to stop_words is: github.com/scikit-learn/scikit-learn/blob/main/sklearn/… (had to comment because edit queue was overflowing)Floppy

© 2022 - 2024 — McMap. All rights reserved.