Getting rid of stop words and document tokenization using NLTK
Asked Answered
B

4

14

I’m having difficulty eliminating and tokenizing a .text file using nltk. I keep getting the following AttributeError: 'list' object has no attribute 'lower'.

I just can’t figure out what I’m doing wrong, although it’s my first time of doing something like this. Below are my lines of code.I’ll appreciate any suggestions, thanks

    import nltk
    from nltk.corpus import stopwords
    s = open("C:\zircon\sinbo1.txt").read()
    tokens = nltk.word_tokenize(s)
    def cleanupDoc(s):
            stopset = set(stopwords.words('english'))
        tokens = nltk.word_tokenize(s)
        cleanup = [token.lower()for token in tokens.lower() not in stopset and  len(token)>2]
        return cleanup
    cleanupDoc(s)
Bulgar answered 30/6, 2013 at 12:24 Comment(0)
P
27

You can use the stopwords lists from NLTK, see How to remove stop words using nltk or python.

And most probably you would also like to strip off punctuation, you can use string.punctuation, see http://docs.python.org/2/library/string.html:

>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> import string
>>> sent = "this is a foo bar, bar black sheep."
>>> stop = set(stopwords.words('english') + list(string.punctuation))
>>> [i for i in word_tokenize(sent.lower()) if i not in stop]
['foo', 'bar', 'bar', 'black', 'sheep']
Preceptor answered 11/3, 2014 at 11:31 Comment(0)
F
1

From the error message, it seems like you're trying to convert a list, not a string, to lowercase. Your tokens = nltk.word_tokenize(s) is probably not returning what you expect (which seems to be a string).

It would be helpful to know what format your sinbo.txt file is in.

A few syntax issues:

  1. Import should be in lowercase: import nltk

  2. The line s = open("C:\zircon\sinbo1.txt").read() is reading the whole file in, not a single line at a time. This may be problematic because word_tokenize works on a single sentence, not any sequence of tokens. This current line assumes that your sinbo.txt file contains a single sentence. If it doesn't, you may want to either (a) use a for loop on the file instead of using read() or (b) use punct_tokenizer on a whole bunch of sentences divided by punctuation.

  3. The first line of your cleanupDoc function is not properly indented. your function should look like this (even if the functions within it change).

    import nltk
    from nltk.corpus import stopwords 
    def cleanupDoc(s):
     stopset = set(stopwords.words('english'))
     tokens = nltk.word_tokenize(s)
     cleanup = [token.lower() for token in tokens if token.lower() not in stopset and  len(token)>2]
     return cleanup
    
Focalize answered 1/7, 2013 at 21:54 Comment(0)
F
1
import nltk
from nltk.corpus import stopwords
def cleanupDoc(s):
     stopset = set(stopwords.words('english'))
     tokens = nltk.word_tokenize(s)
     cleanup = " ".join(filter(lambda word: word not in stopset, s.split()))
     return cleanup
s = "I am going to disco and bar tonight"
tokens = nltk.word_tokenize(s)
x = cleanupDoc(s)
print x

This code can help in solving the above problem.

Festoonery answered 10/3, 2014 at 12:55 Comment(0)
J
0

In your particular case the error is in cleanup = [token.lower()for token in tokens.lower() not in stopset and len(token)>2]

tokens is a list, so you cannot do tokens.lower() operation on a list. So, another way of writing the above code would be,

cleanup = [token.lower()for token in tokens if token.lower() not in stopset and  len(token)>2]

I hope this helps.

Jordanson answered 25/9, 2019 at 17:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.