Adding words to scikit-learn's CountVectorizer's stop list

About

Asked 24/6, 2014 at 12:19 Answered 24/6, 2014 at 12:33

Scikit-learn's CountVectorizer class lets you pass a string 'english' to the argument stop_words. I want to add some things to this predefined list. Can anyone tell me how to do this?

Tapley answered 24/6, 2014 at 12:19 Comment(2)

Do you mean you want the default 'english' stop_words plus some extras of your own? – Jeromyjerreed 24/6, 2014 at 12:24

this post has been a life saver. – Hellenism 14/3, 2017 at 17:23

According to the source code for sklearn.feature_extraction.text, the full list (actually a frozenset, from stop_words) of ENGLISH_STOP_WORDS is exposed through __all__. Therefore if you want to use that list plus some more items, you could do something like:

from sklearn.feature_extraction import text 

stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)

(where my_additional_stop_words is any sequence of strings) and use the result as the stop_words argument. This input to CountVectorizer.__init__ is parsed by _check_stop_list, which will pass the new frozenset straight through.

Jeromyjerreed answered 24/6, 2014 at 12:33 Comment(7)

it's interesting to note there are only 318 stopwords in the set. Maybe these pre-supplied stopwords need to be expanded by the person using it. – Guddle 18/1, 2016 at 8:39

Works very well with CountVectorizer(stop_words = text.ENGLISH_STOP_WORDS.union(array_example)) – Radiocarbon 2/6, 2020 at 22:10

I tried to use this code but it did not work for me. Here is a reproducible example: my_text = ['John I hope you like it',"Tyler place is near by"] stop_words =text.ENGLISH_STOP_WORDS.union("john") count_vectorizer = CountVectorizer(stop_words = 'english') vec = count_vectorizer.fit(my_text) bag_of_words = vec.transform(my_text) sum_words = bag_of_words.sum(axis=0) #sum_words is a 1xn_words matrix without labels words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()] sublst = sorted(words_freq, key = lambda x: x[1], reverse=True) sublst still has john – Bessie 25/3, 2022 at 18:14

@Bessie you're not actually using your custom set of stop words... – Jeromyjerreed 25/3, 2022 at 18:16

Being new to python, I am not able to figure out how I can use it. I thought CountVectorizer(stop_words = 'english') means using the stop_words that I already have augmented with my own list. Thx in advance if you can show in code how to actually use it. – Bessie 25/3, 2022 at 19:7

@Bessie note that the original stop words are a frozenset, which is immutable. This creates a new set, it doesn't change the old one. – Jeromyjerreed 26/3, 2022 at 9:40

the updated link to stop_words is: github.com/scikit-learn/scikit-learn/blob/main/sklearn/… (had to comment because edit queue was overflowing) – Floppy 18/2, 2023 at 15:14

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags