Adding words to nltk stoplist
Asked Answered
C

10

23

I have some code that removes stop words from my data set, as the stop list doesn't seem to remove a majority of the words I would like it too, I'm looking to add words to this stop list so that it will remove them for this case. The code i'm using to remove stop words is:

word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]

I'm unsure of the correct syntax for adding words and can't seem to find the correct one anywhere. Any help is appreciated. Thanks.

Clepsydra answered 1/4, 2011 at 9:49 Comment(0)
T
29

You can simply use the append method to add words to it:

stopwords = nltk.corpus.stopwords.words('english')
stopwords.append('newWord')

or extend to append a list of words, as suggested by Charlie on the comments.

stopwords = nltk.corpus.stopwords.words('english')
newStopWords = ['stopWord1','stopWord2']
stopwords.extend(newStopWords)
Takahashi answered 12/9, 2017 at 16:42 Comment(1)
CustomListofWordstoExclude = ['cat','dog'] stopwords.extend(CustomListofWordstoExclude) I used your code but then used extend() to add my own list to itAquarius
T
7
import nltk
stopwords = nltk.corpus.stopwords.words('english')
new_words=('re','name', 'user', 'ct')
for i in new_words:
    stopwords.append(i)
print(stopwords)
Tieck answered 12/2, 2019 at 12:0 Comment(0)
V
3

The way how I did on my Ubuntu machine was, I ctrl + F for "stopwords" in root. It gave me a folder. I stepped inside it which had different files. I opened "english" which had barely 128 words. Added my words to it. Saved and done.

Varhol answered 21/3, 2015 at 8:40 Comment(0)
S
2

The english stop words is a file within nltk/corpus/stopwords/english.txt (I guess it would be here...i dont have nltk on this machine..best thing would be to search 'english.txt within nltk repo)

You can just add your new stop words in this file.

also try looking at bloom filters if your stop word list increases to few hundreds

Soma answered 1/4, 2011 at 11:11 Comment(3)
any good english stopword list out there? the nltk one seems pretty poorHerbie
@fabrizioM fs1.position2.com/bm/txt/stopwords.txt it was the list used by me in my last company..Soma
@Soma this is a way better list than NLTK's! Thanks!Bede
F
2

I always do stopset = set(nltk.corpus.stopwords.words('english')) at the top of any module that needs it. Then it's easy to add more words to the set, plus membership checks are faster.

Floats answered 1/4, 2011 at 16:1 Comment(0)
R
2

Was also looking for solution on this. After some trail and error I got to add words to the stoplist. Hope this helps.

def removeStopWords(str):
#select english stopwords
cachedStopWords = set(stopwords.words("english"))
#add custom words
cachedStopWords.update(('and','I','A','And','So','arnt','This','When','It','many','Many','so','cant','Yes','yes','No','no','These','these'))
#remove stop words
new_str = ' '.join([word for word in str.split() if word not in cachedStopWords]) 
return new_str
Revengeful answered 8/1, 2015 at 13:40 Comment(0)
F
2
 import nltk
 nltk.download('stopwords')
 from nltk.corpus import stopwords
 #add new words to the list
 new_stopwords = ["new", "custom", "words", "add","to","list"]
 stopwrd = nltk.corpus.stopwords.words('english')
 stopwrd.extend(new_stopwords)
Femmine answered 12/12, 2017 at 6:27 Comment(0)
S
1

I use this code for adding new stop words to nltk stop word list in python

from nltk.corpus import stopwords
#...#
stop_words = set(stopwords.words("english"))

#add words that aren't in the NLTK stopwords list
new_stopwords = ['apple','mango','banana']
new_stopwords_list = stop_words.union(new_stopwords)

print(new_stopwords_list)
Shush answered 20/1, 2019 at 8:58 Comment(0)
C
0

I've found (Python 3.7, jupyter notebook on Windows 10, corporate firewall) that creating a list and using the 'append' command results in the entire stopwords list being appended as an element of the original list.

This makes 'stopwords' into a list of lists.

Snijesh's answer works well, as does Jayantha's answer.

Concession answered 23/1, 2020 at 17:31 Comment(0)
V
0

STOP_WORDS.add(“Lol”) #Add new stopword into corpus as you wish

Vesicatory answered 7/6, 2021 at 5:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.