Get rid of stopwords and punctuation
Asked Answered
M

3

10

I'm struggling with NLTK stopword.

Here's my bit of code.. Could someone tell me what's wrong?

from nltk.corpus import stopwords

def removeStopwords( palabras ):
     return [ word for word in palabras if word not in stopwords.words('spanish') ]

palabras = ''' my text is here '''
Marathon answered 4/4, 2011 at 16:53 Comment(3)
Are you just missing the call to the function? Try adding print removeStopwords(palabras) after your last line.Machuca
Right!!! I've missed it!Marathon
I don't know if you faced the problem that stopwords.words('spanish') return a list where not every word is encoded with Unicode. So, checking whether a word is present in words encoded with Unicode (u'word'), and using 'in' operator, can lead to a wrong comparison. I got this message: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal any clue? I guess NLTK.CORPUS.STOPWORDS should return unicoded lists gracias!Skeleton
R
28

Your problem is that the iterator for a string returns each character not each word.

For example:

>>> palabras = "Buenos dias"
>>> [c for c in palabras]
['B', 'u', 'e', 'n', 'a', 's', ' ', 'd', 'i', 'a', 's']

You need to iterate and check each word, fortunately the split function already exists in the python standard library under the string module. However you are dealing with natural language including punctuation you should look here for a more robust answer that uses the re module.

Once you have a list of words you should lowercase them all before comparison and then compare them in the manner that you have shown already.

Buena suerte.

EDIT 1

Okay try this code, it should work for you. It shows two ways to do it, they are essentially identical but the first is a bit clearer while the second is more pythonic.

import re
from nltk.corpus import stopwords

scentence = 'El problema del matrimonio es que se acaba todas las noches despues de hacer el amor, y hay que volver a reconstruirlo todas las mananas antes del desayuno.'

#We only want to work with lowercase for the comparisons
scentence = scentence.lower() 

#remove punctuation and split into seperate words
words = re.findall(r'\w+', scentence,flags = re.UNICODE | re.LOCALE) 

#This is the simple way to remove stop words
important_words=[]
for word in words:
    if word not in stopwords.words('spanish'):
        important_words.append(word)

print important_words

#This is the more pythonic way
important_words = filter(lambda x: x not in stopwords.words('spanish'), words)

print important_words 

I hope this helps you.

Roseline answered 4/4, 2011 at 17:5 Comment(3)
We use "Buena suerte". the re module help me with the punctuation, but i'm still trying to combine it with the stopword functionMarathon
Turn the stopwords into a set, it will be a lot faster.Rebellion
If they are using nltk already why just not use a Tokenizer and solve it in three lines?Oldster
T
4

Using a tokenizer first you compare a list of tokens (symbols) against the stoplist, so you don't need the re module. I added an extra argument in order to switch among languages.

def remove_stopwords(sentence, language):
    return [ token for token in nltk.word_tokenize(sentence) if token.lower() not in stopwords.words(language) ]

Dime si te fue de util ;)

Tjaden answered 8/1, 2015 at 16:32 Comment(0)
M
2

Another option with more modern modules (2020)

from nltk.corpus import stopwords
from textblob import TextBlob

def removeStopwords( texto):
    blob = TextBlob(texto).words
    outputlist = [word for word in blob if word not in stopwords.words('spanish')]
    return(' '.join(word for word in outputlist))
Mite answered 9/3, 2020 at 12:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.