How to remove stop words using nltk or python

H

14

146

I have a dataset from which I would like to remove stop words.

I used NLTK to get a list of stop words:

from nltk.corpus import stopwords

stopwords.words('english')

Exactly how do I compare the data to the list of stop words, and thus remove the stop words from the data?

Haldas answered 30/3, 2011 at 12:36 Comment(5)

Where did you get the stopwords from? Is this from NLTK? – Epigoni 7/4, 2014 at 22:15

@MattO'Brien from nltk.corpus import stopwords for future googlers – Sabella 13/5, 2015 at 21:11

It is also necessary to run nltk.download("stopwords") in order to make the stopword dictionary available. – Teredo 10/7, 2015 at 17:12

See also https://mcmap.net/q/160894/-stopword-removal-with-nltk – Oxy 25/8, 2016 at 13:5

Pay attention that a word like "not" is also considered a stopword in nltk. If you do something like sentiment analysis, spam filtering, a negation may change the entire meaning of the sentence and if you remove it from the processing phase, you might not get accurate results. – Ruching 4/6, 2019 at 12:8

C

249

from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

Comic answered 30/3, 2011 at 12:53 Comment(4)

Thanks to both answers, they both work although it would seem i have a flaw in my code preventing the stop list from working correctly. Should this be a new question post? not sure how things work around here just yet! – Haldas 30/3, 2011 at 14:29

To improve performance, consider stops = set(stopwords.words("english")) instead. – Araarab 7/9, 2013 at 22:4

>>> import nltk >>> nltk.download() Source – Putdown 14/12, 2017 at 20:33

stopwords.words('english') are lower case. So make sure to use only lower case words in the list e.g. [w.lower() for w in word_list] – Haldas 24/8, 2018 at 18:10

V

19

You could also do a set diff, for example:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

Vail answered 26/3, 2012 at 22:25 Comment(2)

Note: this converts the sentence to a SET which removes all the duplicate words and therefore you will not be able to use frequency counting on the result – Chandra 21/2, 2017 at 23:59

converting to a set might remove viable information from the sentence by scraping multiple occurrences of an important word. – Triple 28/11, 2019 at 3:57

S

19

To exclude all type of stop-words including nltk stop-words, you could do something like this:

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

Smarm answered 27/10, 2017 at 14:31 Comment(2)

I'm getting len(get_stop_words('en')) == 174 vs len(stopwords.words('english')) == 179 – Cavalcade 5/3, 2020 at 21:26

Iteration through a list is not efficient. – Sixtasixteen 29/6, 2021 at 10:9

L

15

I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

Lecythus answered 30/3, 2011 at 12:51 Comment(1)

this will be a whole lot slower than Daren Thomas's list comprehension... – Confine 26/8, 2016 at 10:54

D

12

There's a very simple light-weight python package stop-words just for this sake.

Fist install the package using: pip install stop-words

Then you can remove your words in one line using list comprehension:

from stop_words import get_stop_words

filtered_words = [word for word in dataset if word not in get_stop_words('english')]

This package is very light-weight to download (unlike nltk), works for both Python 2 and Python 3 ,and it has stop words for many other languages like:

    Arabic
    Bulgarian
    Catalan
    Czech
    Danish
    Dutch
    English
    Finnish
    French
    German
    Hungarian
    Indonesian
    Italian
    Norwegian
    Polish
    Portuguese
    Romanian
    Russian
    Spanish
    Swedish
    Turkish
    Ukrainian

Disabuse answered 22/9, 2019 at 12:13 Comment(0)

S

6

Here is my take on this, in case you want to immediately get the answer into a string (instead of a list of filtered words):

STOPWORDS = set(stopwords.words('english'))
text =  ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text

Shrum answered 8/2, 2020 at 21:1 Comment(1)

Don't use this approach in french l' or else will not be capture. – Lucilius 22/2, 2020 at 19:27

C

4

Use textcleaner library to remove stopwords from your data.

Follow this link:https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds

Follow these steps to do so with this library.

pip install textcleaner

After installing:

import textcleaner as tc
data = tc.document(<file_name>) 
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

Use above code to remove the stop-words.

Calpe answered 12/2, 2019 at 12:30 Comment(0)

P

2

from nltk.corpus import stopwords 

from nltk.tokenize import word_tokenize 

example_sent = "This is a sample sentence, showing off the stop words filtration."

  
stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(example_sent) 
  
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
filtered_sentence = [] 
  
for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
  
print(word_tokens) 
print(filtered_sentence)

Plourde answered 5/7, 2020 at 8:27 Comment(0)

T

2

Although the question is a bit old, here is a new library, which is worth mentioning, that can do extra tasks.

In some cases, you don't want only to remove stop words. Rather, you would want to find the stopwords in the text data and store it in a list so that you can find the noise in the data and make it more interactive.

The library is called 'textfeatures'. You can use it as follows:

! pip install textfeatures
import textfeatures as tf
import pandas as pd

For example, suppose you have the following set of strings:

texts = [
    "blue car and blue window",
    "black crow in the window",
    "i see my reflection in the window"]

df = pd.DataFrame(texts) # Convert to a dataframe
df.columns = ['text'] # give a name to the column
df

Now, call the stopwords() function and pass the parameters you want:

tf.stopwords(df,"text","stopwords") # extract stop words
df[["text","stopwords"]].head() # give names to columns

The result is going to be:

    text                                 stopwords
0   blue car and blue window             [and]
1   black crow in the window             [in, the]
2   i see my reflection in the window    [i, my, in, the]

As you can see, the last column has the stop words included in that docoument (record).

Trunnion answered 24/2, 2021 at 12:55 Comment(1)

probably should not use alias tf, as this makes it looks like a new TensorFlow feature for many of us :-) – Hyponasty 20/2, 2023 at 21:38

C

1

you can use this function, you should notice that you need to lower all the words

from nltk.corpus import stopwords

def remove_stopwords(word_list):
        processed_word_list = []
        for word in word_list:
            word = word.lower() # in case they arenet all lower cased
            if word not in stopwords.words("english"):
                processed_word_list.append(word)
        return processed_word_list

Campania answered 13/6, 2017 at 15:48 Comment(0)

S

1

using filter:

from nltk.corpus import stopwords
# ...  
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))

Superphosphate answered 2/10, 2017 at 2:55 Comment(1)

if word_list is large this code is very slow. It is better to convert the stopwords list to a set before using it: .. in set(stopwords.words('english')). – Trek 23/9, 2019 at 8:43

A

1

I will show you some example First I extract the text data from the data frame (twitter_df) to process further as following

     from nltk.tokenize import word_tokenize
     tweetText = twitter_df['text']

Then to tokenize I use the following method

     from nltk.tokenize import word_tokenize
     tweetText = tweetText.apply(word_tokenize)

Then, to remove stop words,

     from nltk.corpus import stopwords
     nltk.download('stopwords')

     stop_words = set(stopwords.words('english'))
     tweetText = tweetText.apply(lambda x:[word for word in x if word not in stop_words])
     tweetText.head()

I Think this will help you

Atahualpa answered 13/10, 2020 at 5:28 Comment(0)

L

0

In case your data are stored as a Pandas DataFrame, you can use remove_stopwords from textero that use the NLTK stopwords list by default.

import pandas as pd
import texthero as hero
df['text_without_stopwords'] = hero.remove_stopwords(df['text'])

Loaf answered 2/6, 2020 at 6:58 Comment(0)

F

0

Try this :

sentence = " ".join([word for word in sentence.split() if word not in stopwords.words('english')]

Fallfish answered 28/12, 2023 at 18:59 Comment(0)

Recommended topics

Hot tags