What is the best way to add/remove stop words with spacy? I am using token.is_stop
function and would like to make some custom changes to the set. I was looking at the documentation but could not find anything regarding of stop words. Thanks!
You can edit them before processing your text like this (see this post):
>>> import spacy
>>> nlp = spacy.load("en")
>>> nlp.vocab["the"].is_stop = False
>>> nlp.vocab["definitelynotastopword"].is_stop = True
>>> sentence = nlp("the word is definitelynotastopword")
>>> sentence[0].is_stop
False
>>> sentence[3].is_stop
True
Note: This seems to work <=v1.8. For newer versions, see other answers.
TypeError: an integer is required
–
Fairchild Using Spacy 2.0.11, you can update its stopwords set using one of the following:
To add a single stopword:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words.add("my_new_stopword")
To add several stopwords at once:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}
To remove a single stopword:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words.remove("whatever")
To remove several stopwords at once:
import spacy
nlp = spacy.load("en")
nlp.Defaults.stop_words -= {"whatever", "whenever"}
Note: To see the current set of stopwords, use:
print(nlp.Defaults.stop_words)
Update : It was noted in the comments that this fix only affects the current execution. To update the model, you can use the methods nlp.to_disk("/path")
and nlp.from_disk("/path")
(further described at https://spacy.io/usage/saving-loading).
a|=b
being equivalent to a=a.union(b)
. Similarly, the operator -=
allows to perform a set difference. The curly bracket syntax allows to create sets in a simple way, a={1,2,3}
being equivalent to a=set(1,2,3)
. –
Conferva is_stop
e.g. nlp.Defaults.stop_words.add('foo'); nlp.vocab['foo'].is_stop returns False –
Apostatize You can edit them before processing your text like this (see this post):
>>> import spacy
>>> nlp = spacy.load("en")
>>> nlp.vocab["the"].is_stop = False
>>> nlp.vocab["definitelynotastopword"].is_stop = True
>>> sentence = nlp("the word is definitelynotastopword")
>>> sentence[0].is_stop
False
>>> sentence[3].is_stop
True
Note: This seems to work <=v1.8. For newer versions, see other answers.
TypeError: an integer is required
–
Fairchild Short answer for version 2.0 and above (just tested with 3.4+):
from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS) # <- set of Spacy's default stop words
STOP_WORDS.add("your_additional_stop_word_here")
- This loads all stop words as a set.
- You can add your stop words to
STOP_WORDS
or use your own list in the first place.
To check if the attribute is_stop
for the stop words is set to True
use this:
for word in STOP_WORDS:
lexeme = nlp.vocab[word]
print(lexeme.text, lexeme.is_stop)
In the unlikely case that stop words for some reason aren't set to is_stop = True
do this:
for word in STOP_WORDS:
lexeme = nlp.vocab[word]
lexeme.is_stop = True
Detailed explanation step by step with links to documentation.
First we import spacy:
import spacy
To instantiate class Language
as nlp
from scratch we need to import Vocab
and Language
. Documentation and example here.
from spacy.vocab import Vocab
from spacy.language import Language
# create new Language object from scratch
nlp = Language(Vocab())
stop_words
is a default attribute of class Language
and can be set to customize the default language data. Documentation here. You can find spacy's GitHub repo folder with defaults for various languages here.
For our instance of nlp
we get 0 stop words which is reasonable since we haven't set any language with defaults
print(f"Language instance 'nlp' has {len(nlp.Defaults.stop_words)} default stopwords.")
>>> Language instance 'nlp' has 0 default stopwords.
Let's import English language defaults.
from spacy.lang.en import English
Now we have 326 default stop words.
print(f"The language default English has {len(spacy.lang.en.STOP_WORDS)} stopwords.")
print(sorted(list(spacy.lang.en.STOP_WORDS))[:10])
>>> The language default English has 326 stopwords.
>>> ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across']
Let's create a new instance of Language
, now with defaults for English. We get the same result.
nlp = English()
print(f"Language instance 'nlp' now has {len(nlp.Defaults.stop_words)} default stopwords.")
print(sorted(list(nlp.Defaults.stop_words))[:10])
>>> Language instance 'nlp' now has 326 default stopwords.
>>> ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across']
To check if all words are set to is_stop = True
we iterate over the stop words, retrieve the lexeme from vocab
and print out the is_stop
attribute.
[nlp.vocab[word].is_stop for word in nlp.Defaults.stop_words][:10]
>>> [True, True, True, True, True, True, True, True, True, True]
We can add stopwords to the English language defaults.
spacy.lang.en.STOP_WORDS.add("aaaahhh-new-stopword")
print(len(spacy.lang.en.STOP_WORDS))
# these propagate to our instance 'nlp' too!
print(len(nlp.Defaults.stop_words))
>>> 327
>>> 327
Or we can add new stopwords to instance nlp
. However, these propagate to our language defaults too!
nlp.Defaults.stop_words.add("_another-new-stop-word")
print(len(spacy.lang.en.STOP_WORDS))
print(len(nlp.Defaults.stop_words))
>>> 328
>>> 328
The new stop words are set to is_stop = True
.
print(nlp.vocab["aaaahhh-new-stopword"].is_stop)
print(nlp.vocab["_another-new-stop-word"].is_stop)
>>> True
>>> True
For 2.0 use the following:
for word in nlp.Defaults.stop_words:
lex = nlp.vocab[word]
lex.is_stop = True
les.is_stop
should already be True
in the bug-free future. –
Klaraklarika This collects the stop words too :)
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
In latest version following would remove the word out of the list:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords.remove('not')
For version 2.3.0 If you want to replace the entire list instead of adding or removing a few stop words, you can do this:
custom_stop_words = set(['the','and','a'])
# First override the stop words set for the language
cls = spacy.util.get_lang_class('en')
cls.Defaults.stop_words = custom_stop_words
# Now load your model
nlp = spacy.load('en_core_web_md')
The trick is to assign the stop word set for the language before loading the model. It also ensures that any upper/lower case variation of the stop words are considered stop words.
See below piece of code
# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
# Print the set of spaCy's default stop words (remember that sets are unordered):
print(nlp.Defaults.stop_words)
len(nlp.Defaults.stop_words)
# Make list of word you want to add to stop words
list = ['apple', 'ball', 'cat']
# Iterate this in loop
for item in list:
# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add(item)
# Set the stop_word tag on the lexeme
nlp.vocab[item].is_stop = True
Hope this helps. You can print length before and after to confirm.
© 2022 - 2024 — McMap. All rights reserved.
from spacy.en.word_sets import STOP_WORDS
– Bordiuk