Print 10 most frequently occurring words of a text that including and excluding stopwords
Asked Answered
W

3

15

I got the question from here with my changes. I have following code:

from nltk.corpus import stopwords
def content_text(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() in stopwords]
    return content

How can I print the 10 most frequently occurring words of a text that 1)including and 2)excluding stopwords?

Whit answered 8/2, 2015 at 10:22 Comment(1)
possible duplicate of How can I count the occurrences of a list item in Python?Foretell
S
23

There is a FreqDist function in nltk

import nltk
allWords = nltk.tokenize.word_tokenize(text)
allWordDist = nltk.FreqDist(w.lower() for w in allWords)

stopwords = nltk.corpus.stopwords.words('english')
allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords)    

to extract 10 most common:

mostCommon= allWordDist.most_common(10).keys()
Satinwood answered 8/2, 2015 at 11:15 Comment(3)
I get this error: AttributeError: 'FreqDist' object has no attribute 'most_common'Whit
Can you please provide full listing?Satinwood
You should ask stopwords with strings in lowercase. From: allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords) To: allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w.lower() not in stopwords)Leger
C
5

Not sure on the is stopwords in the function, I imagine it needs to be in but you can use a Counterdict with most_common(10) to get the 10 most frequent:

from collections import Counter
from string import punctuation


def content_text(text):
    stopwords = set(nltk.corpus.stopwords.words('english')) # 0(1) lookups
    with_stp = Counter()
    without_stp  = Counter()
    with open(text) as f:
        for line in f:
            spl = line.split()
            # update count off all words in the line that are in stopwrods
            with_stp.update(w.lower().rstrip(punctuation) for w in spl if w.lower() in stopwords)
               # update count off all words in the line that are not in stopwords
            without_stp.update(w.lower().rstrip(punctuation)  for w in spl if w  not in stopwords)
    # return a list with top ten most common words from each 
    return [x for x in with_stp.most_common(10)],[y for y in without_stp.most_common(10)]
wth_stop, wthout_stop = content_text(...)

If you are passing in an nltk file object just iterate over it:

def content_text(text):
    stopwords = set(nltk.corpus.stopwords.words('english'))
    with_stp = Counter()
    without_stp  = Counter()
    for word in text:
        # update count off all words in the line that are in stopwords
        word = word.lower()
        if word in stopwords:
             with_stp.update([word])
        else:
           # update count off all words in the line that are not in stopwords
            without_stp.update([word])
    # return a list with top ten most common words from each
    return [k for k,_ in with_stp.most_common(10)],[y for y,_ in without_stp.most_common(10)]

print(content_text(nltk.corpus.inaugural.words('2009-Obama.txt')))

The nltk method includes punctuation so that may not be what you want.

Circinus answered 8/2, 2015 at 10:32 Comment(12)
when I write wth_stop, wthout_stop = content_text(nltk.corpus.inaugural.words('2009-Obama.txt')) I get error.Whit
@user2064809, I tested it and it works fine for me, what error are you getting?Circinus
TypeError: coercing to Unicode: need string or buffer, StreamBackedCorpusView foundWhit
what should I put exactly inside content_text() function?Whit
just put '2009-Obama.txt'Circinus
@user2064809: You might have to make some changes to the code if you are using python 2. Also, if need help understanding an error message, you need to provide all of it. It's much easier to understand when we know exactly where in the script the exception was raised.Gee
I was just guessing based on the error message. A lot of functions that used to return ascii strings or simple structures such as lists in python 2, will return unicode and more complicated, but efficient iterators such as Views in python 3. Could it perhaps be caused by a different version of nltk?Gee
@HåkenLid, the error was because the OP was passing an nltk object to the first function instead of just a file nameCircinus
That's all good then. Are you sure the last line in your snippet will run properly, though? There's two closing parentheses missing, and you'll probably have to import print_function with python 2.Gee
@HåkenLid, that was just a typo from copy/pasting. There is no need to import print_function.Circinus
It works! thank you. I should put 'computer address' in first code: wth_stop, wthout_stop = content_text('C:\\Documents and Settings\\Application Data\\nltk_data\\corpora\\inaugural\\2009-Obama.txt') instead of nltk.corpus.inaugural.words('2009-Obama.txt'). But in the second code the print(content_text(nltk.corpus.inaugural.words('2009-Obama.txt'))) works!!Whit
@user2064809, I was not sure what exactly you were passing as text so I just added a way to use a normal file and an nltk file object. The first example will work for any file. just pass the path to the file.Circinus
G
1

You can try this:

for word, frequency in allWordsDist.most_common(10):
    print('%s;%d' % (word, frequency)).encode('utf-8')
Gamaliel answered 28/4, 2016 at 6:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.