Print 10 most frequently occurring words of a text that including and excluding stopwords

Asked 8/2, 2015 at 10:22 Answered 28/4, 2016 at 6:51

Solved python nltk word-frequency find-occurrences

I got the question from here with my changes. I have following code:

from nltk.corpus import stopwords
def content_text(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() in stopwords]
    return content

How can I print the 10 most frequently occurring words of a text that 1)including and 2)excluding stopwords?

Whit answered 8/2, 2015 at 10:22 Comment(1)

possible duplicate of How can I count the occurrences of a list item in Python? – Foretell 8/2, 2015 at 11:18

There is a FreqDist function in nltk

import nltk
allWords = nltk.tokenize.word_tokenize(text)
allWordDist = nltk.FreqDist(w.lower() for w in allWords)

stopwords = nltk.corpus.stopwords.words('english')
allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords)

to extract 10 most common:

mostCommon= allWordDist.most_common(10).keys()

Satinwood answered 8/2, 2015 at 11:15 Comment(3)

I get this error: AttributeError: 'FreqDist' object has no attribute 'most_common' – Whit 8/2, 2015 at 14:46

Can you please provide full listing? – Satinwood 8/2, 2015 at 20:17

You should ask stopwords with strings in lowercase. From: allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords) To: allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w.lower() not in stopwords) – Leger 10/5, 2017 at 13:26

Not sure on the is stopwords in the function, I imagine it needs to be in but you can use a Counterdict with most_common(10) to get the 10 most frequent:

from collections import Counter
from string import punctuation


def content_text(text):
    stopwords = set(nltk.corpus.stopwords.words('english')) # 0(1) lookups
    with_stp = Counter()
    without_stp  = Counter()
    with open(text) as f:
        for line in f:
            spl = line.split()
            # update count off all words in the line that are in stopwrods
            with_stp.update(w.lower().rstrip(punctuation) for w in spl if w.lower() in stopwords)
               # update count off all words in the line that are not in stopwords
            without_stp.update(w.lower().rstrip(punctuation)  for w in spl if w  not in stopwords)
    # return a list with top ten most common words from each 
    return [x for x in with_stp.most_common(10)],[y for y in without_stp.most_common(10)]
wth_stop, wthout_stop = content_text(...)

If you are passing in an nltk file object just iterate over it:

def content_text(text):
    stopwords = set(nltk.corpus.stopwords.words('english'))
    with_stp = Counter()
    without_stp  = Counter()
    for word in text:
        # update count off all words in the line that are in stopwords
        word = word.lower()
        if word in stopwords:
             with_stp.update([word])
        else:
           # update count off all words in the line that are not in stopwords
            without_stp.update([word])
    # return a list with top ten most common words from each
    return [k for k,_ in with_stp.most_common(10)],[y for y,_ in without_stp.most_common(10)]

print(content_text(nltk.corpus.inaugural.words('2009-Obama.txt')))

The nltk method includes punctuation so that may not be what you want.

Circinus answered 8/2, 2015 at 10:32 Comment(12)

when I write wth_stop, wthout_stop = content_text(nltk.corpus.inaugural.words('2009-Obama.txt')) I get error. – Whit 8/2, 2015 at 11:37

@user2064809, I tested it and it works fine for me, what error are you getting? – Circinus 8/2, 2015 at 11:39

TypeError: coercing to Unicode: need string or buffer, StreamBackedCorpusView found – Whit 8/2, 2015 at 11:41

what should I put exactly inside content_text() function? – Whit 8/2, 2015 at 11:56

just put '2009-Obama.txt' – Circinus 8/2, 2015 at 11:58

@user2064809: You might have to make some changes to the code if you are using python 2. Also, if need help understanding an error message, you need to provide all of it. It's much easier to understand when we know exactly where in the script the exception was raised. – Gee 8/2, 2015 at 12:28

I was just guessing based on the error message. A lot of functions that used to return ascii strings or simple structures such as lists in python 2, will return unicode and more complicated, but efficient iterators such as Views in python 3. Could it perhaps be caused by a different version of nltk? – Gee 8/2, 2015 at 12:36

@HåkenLid, the error was because the OP was passing an nltk object to the first function instead of just a file name – Circinus 8/2, 2015 at 12:37

That's all good then. Are you sure the last line in your snippet will run properly, though? There's two closing parentheses missing, and you'll probably have to import print_function with python 2. – Gee 8/2, 2015 at 12:41

@HåkenLid, that was just a typo from copy/pasting. There is no need to import print_function. – Circinus 8/2, 2015 at 12:42

It works! thank you. I should put 'computer address' in first code:

wth_stop, wthout_stop = content_text('C:\\Documents and Settings\\Application Data\\nltk_data\\corpora\\inaugural\\2009-Obama.txt')

instead of nltk.corpus.inaugural.words('2009-Obama.txt'). But in the second code the print(content_text(nltk.corpus.inaugural.words('2009-Obama.txt'))) works!! – Whit 8/2, 2015 at 12:49

@user2064809, I was not sure what exactly you were passing as text so I just added a way to use a normal file and an nltk file object. The first example will work for any file. just pass the path to the file. – Circinus 8/2, 2015 at 13:12

You can try this:

for word, frequency in allWordsDist.most_common(10):
    print('%s;%d' % (word, frequency)).encode('utf-8')

Gamaliel answered 28/4, 2016 at 6:51 Comment(0)

Recommended topics

Hot tags