Add/remove custom stop words with spacy
Asked Answered
F

8

69

What is the best way to add/remove stop words with spacy? I am using token.is_stop function and would like to make some custom changes to the set. I was looking at the documentation but could not find anything regarding of stop words. Thanks!

Fairchild answered 15/12, 2016 at 18:11 Comment(1)
The complete list: from spacy.en.word_sets import STOP_WORDSBordiuk
N
53

You can edit them before processing your text like this (see this post):

>>> import spacy
>>> nlp = spacy.load("en")
>>> nlp.vocab["the"].is_stop = False
>>> nlp.vocab["definitelynotastopword"].is_stop = True
>>> sentence = nlp("the word is definitelynotastopword")
>>> sentence[0].is_stop
False
>>> sentence[3].is_stop
True

Note: This seems to work <=v1.8. For newer versions, see other answers.

Nano answered 15/12, 2016 at 19:52 Comment(3)
Ah nice. Thank you!Fairchild
This solution does not seem to be working anymore with version 1.9.0? I am getting TypeError: an integer is requiredFairchild
@Fairchild the reason for the error is because the vocab input word should be unicode (use u"the" instead of "the")Gardia
C
71

Using Spacy 2.0.11, you can update its stopwords set using one of the following:

To add a single stopword:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words.add("my_new_stopword")

To add several stopwords at once:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}

To remove a single stopword:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words.remove("whatever")

To remove several stopwords at once:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words -= {"whatever", "whenever"}

Note: To see the current set of stopwords, use:

print(nlp.Defaults.stop_words)

Update : It was noted in the comments that this fix only affects the current execution. To update the model, you can use the methods nlp.to_disk("/path") and nlp.from_disk("/path") (further described at https://spacy.io/usage/saving-loading).

Conferva answered 1/8, 2018 at 6:49 Comment(5)
@AustinT It is syntactic sugar to obtain the union of two sets, a|=b being equivalent to a=a.union(b). Similarly, the operator -= allows to perform a set difference. The curly bracket syntax allows to create sets in a simple way, a={1,2,3} being equivalent to a=set(1,2,3).Conferva
This doesn't actually affect the model.Viable
I mean that it actually doesn't seem to affect the current execution either. (Maybe I'm running something out of order.) The other method seems foolproof.Viable
I concur with @fny. While this adds the stopwords to nlp.Defaults.stop_word, if you check that word with token.is_stop, you still get False.Costa
Like others, I've found that this approach does not update is_stop e.g. nlp.Defaults.stop_words.add('foo'); nlp.vocab['foo'].is_stop returns FalseApostatize
N
53

You can edit them before processing your text like this (see this post):

>>> import spacy
>>> nlp = spacy.load("en")
>>> nlp.vocab["the"].is_stop = False
>>> nlp.vocab["definitelynotastopword"].is_stop = True
>>> sentence = nlp("the word is definitelynotastopword")
>>> sentence[0].is_stop
False
>>> sentence[3].is_stop
True

Note: This seems to work <=v1.8. For newer versions, see other answers.

Nano answered 15/12, 2016 at 19:52 Comment(3)
Ah nice. Thank you!Fairchild
This solution does not seem to be working anymore with version 1.9.0? I am getting TypeError: an integer is requiredFairchild
@Fairchild the reason for the error is because the vocab input word should be unicode (use u"the" instead of "the")Gardia
D
20

Short answer for version 2.0 and above (just tested with 3.4+):

from spacy.lang.en.stop_words import STOP_WORDS

print(STOP_WORDS) # <- set of Spacy's default stop words

STOP_WORDS.add("your_additional_stop_word_here")
  • This loads all stop words as a set.
  • You can add your stop words to STOP_WORDS or use your own list in the first place.

To check if the attribute is_stop for the stop words is set to True use this:

for word in STOP_WORDS:
    lexeme = nlp.vocab[word]
    print(lexeme.text, lexeme.is_stop)

In the unlikely case that stop words for some reason aren't set to is_stop = True do this:

for word in STOP_WORDS:
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True 

Detailed explanation step by step with links to documentation.

First we import spacy:

import spacy

To instantiate class Language as nlp from scratch we need to import Vocab and Language. Documentation and example here.

from spacy.vocab import Vocab
from spacy.language import Language

# create new Language object from scratch
nlp = Language(Vocab())

stop_words is a default attribute of class Language and can be set to customize the default language data. Documentation here. You can find spacy's GitHub repo folder with defaults for various languages here.

For our instance of nlp we get 0 stop words which is reasonable since we haven't set any language with defaults

print(f"Language instance 'nlp' has {len(nlp.Defaults.stop_words)} default stopwords.")
>>> Language instance 'nlp' has 0 default stopwords.

Let's import English language defaults.

from spacy.lang.en import English

Now we have 326 default stop words.

print(f"The language default English has {len(spacy.lang.en.STOP_WORDS)} stopwords.")
print(sorted(list(spacy.lang.en.STOP_WORDS))[:10])
>>> The language default English has 326 stopwords.
>>> ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across']

Let's create a new instance of Language, now with defaults for English. We get the same result.

nlp = English()
print(f"Language instance 'nlp' now has {len(nlp.Defaults.stop_words)} default stopwords.")
print(sorted(list(nlp.Defaults.stop_words))[:10])
>>> Language instance 'nlp' now has 326 default stopwords.
>>> ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across']

To check if all words are set to is_stop = True we iterate over the stop words, retrieve the lexeme from vocab and print out the is_stop attribute.

[nlp.vocab[word].is_stop for word in nlp.Defaults.stop_words][:10]
>>> [True, True, True, True, True, True, True, True, True, True]

We can add stopwords to the English language defaults.

spacy.lang.en.STOP_WORDS.add("aaaahhh-new-stopword")
print(len(spacy.lang.en.STOP_WORDS))
# these propagate to our instance 'nlp' too! 
print(len(nlp.Defaults.stop_words))
>>> 327
>>> 327

Or we can add new stopwords to instance nlp. However, these propagate to our language defaults too!

nlp.Defaults.stop_words.add("_another-new-stop-word")
print(len(spacy.lang.en.STOP_WORDS))
print(len(nlp.Defaults.stop_words))
>>> 328
>>> 328

The new stop words are set to is_stop = True.

print(nlp.vocab["aaaahhh-new-stopword"].is_stop)
print(nlp.vocab["_another-new-stop-word"].is_stop)
>>> True
>>> True
Decahedron answered 23/9, 2017 at 13:52 Comment(2)
did that with version 2.0 and got "ImportError: No module named en.stop_words"...suggestions?Micamicaela
@Micamicaela Unfortunately I cannot replicate your error. My code still works fine (now even using spacy 3.4.x).Decahedron
U
5

For 2.0 use the following:

for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True
Upkeep answered 25/3, 2018 at 9:55 Comment(1)
You are showing how to fix a broken model as per this bug/workaround. Whilst it is easy to adapt this for the OP needs you could have expanded on why you are writing the code this way: it is currently required because of the bug, but it's an otherwise redundant step, as les.is_stop should already be True in the bug-free future.Klaraklarika
J
4

This collects the stop words too :)

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

Jannette answered 23/8, 2019 at 12:10 Comment(0)
D
0

In latest version following would remove the word out of the list:

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords.remove('not')
Dinse answered 20/9, 2019 at 11:46 Comment(0)
F
0

For version 2.3.0 If you want to replace the entire list instead of adding or removing a few stop words, you can do this:

custom_stop_words = set(['the','and','a'])

# First override the stop words set for the language
cls = spacy.util.get_lang_class('en')
cls.Defaults.stop_words = custom_stop_words

# Now load your model
nlp = spacy.load('en_core_web_md')

The trick is to assign the stop word set for the language before loading the model. It also ensures that any upper/lower case variation of the stop words are considered stop words.

Freehand answered 4/3, 2021 at 21:32 Comment(0)
C
0

See below piece of code

# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

# Print the set of spaCy's default stop words (remember that sets are unordered):
print(nlp.Defaults.stop_words)

len(nlp.Defaults.stop_words)

# Make list of word you want to add to stop words
list = ['apple', 'ball', 'cat']

# Iterate this in loop

for item in list:
    # Add the word to the set of stop words. Use lowercase!
    nlp.Defaults.stop_words.add(item)
    
    # Set the stop_word tag on the lexeme
    nlp.vocab[item].is_stop = True

Hope this helps. You can print length before and after to confirm.

Cantankerous answered 3/1, 2023 at 5:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.