Scikit-learn's CountVectorizer class lets you pass a string 'english' to the argument stop_words. I want to add some things to this predefined list. Can anyone tell me how to do this?
Adding words to scikit-learn's CountVectorizer's stop list
Asked Answered
According to the source code for sklearn.feature_extraction.text
, the full list (actually a frozenset
, from stop_words
) of ENGLISH_STOP_WORDS
is exposed through __all__
. Therefore if you want to use that list plus some more items, you could do something like:
from sklearn.feature_extraction import text
stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)
(where my_additional_stop_words
is any sequence of strings) and use the result as the stop_words
argument. This input to CountVectorizer.__init__
is parsed by _check_stop_list
, which will pass the new frozenset
straight through.
it's interesting to note there are only 318 stopwords in the set. Maybe these pre-supplied stopwords need to be expanded by the person using it. –
Guddle
Works very well with CountVectorizer(stop_words = text.ENGLISH_STOP_WORDS.union(array_example)) –
Radiocarbon
I tried to use this code but it did not work for me. Here is a reproducible example: my_text = ['John I hope you like it',"Tyler place is near by"] stop_words =text.ENGLISH_STOP_WORDS.union("john") count_vectorizer = CountVectorizer(stop_words = 'english') vec = count_vectorizer.fit(my_text) bag_of_words = vec.transform(my_text) sum_words = bag_of_words.sum(axis=0) #sum_words is a 1xn_words matrix without labels words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()] sublst = sorted(words_freq, key = lambda x: x[1], reverse=True) sublst still has john –
Bessie
@Bessie you're not actually using your custom set of stop words... –
Jeromyjerreed
Being new to python, I am not able to figure out how I can use it. I thought CountVectorizer(stop_words = 'english') means using the stop_words that I already have augmented with my own list. Thx in advance if you can show in code how to actually use it. –
Bessie
@Bessie note that the original stop words are a frozenset, which is immutable. This creates a new set, it doesn't change the old one. –
Jeromyjerreed
the updated link to stop_words is: github.com/scikit-learn/scikit-learn/blob/main/sklearn/… (had to comment because edit queue was overflowing) –
Floppy
© 2022 - 2024 — McMap. All rights reserved.
'english'
stop_words
plus some extras of your own? – Jeromyjerreed