Efficiently calculate word frequency in a string
Asked Answered
K

5

19

I am parsing a long string of text and calculating the number of times each word occurs in Python. I have a function that works but I am looking for advice on whether there are ways I can make it more efficient(in terms of speed) and whether there's even python library functions that could do this for me so I'm not reinventing the wheel?

Can you suggest a more efficient way to calculate the most common words that occur in a long string(usually over 1000 words in the string)?

Also whats the best way to sort the dictionary into a list where the 1st element is the most common word, the 2nd element is the 2nd most common word and etc?

test = """abc def-ghi jkl abc
abc"""

def calculate_word_frequency(s):
    # Post: return a list of words ordered from the most
    # frequent to the least frequent

    words = s.split()
    freq  = {}
    for word in words:
        if freq.has_key(word):
            freq[word] += 1
        else:
            freq[word] = 1
    return sort(freq)

def sort(d):
    # Post: sort dictionary d into list of words ordered
    # from highest freq to lowest freq
    # eg: For {"the": 3, "a": 9, "abc": 2} should be
    # sorted into the following list ["a","the","abc"]

    #I have never used lambda's so I'm not sure this is correct
    return d.sort(cmp = lambda x,y: cmp(d[x],d[y]))

print calculate_word_frequency(test)
Krill answered 29/3, 2012 at 5:32 Comment(1)
has_key is deprecated. Use key in d instead. Also, your sort function is pretty wrong. return sorted(d, key = d.__getitem__, reverse = True) would do the descending sort by frequency and return the keys.Fredelia
C
47

Use collections.Counter:

>>> from collections import Counter
>>> test = 'abc def abc def zzz zzz'
>>> Counter(test.split()).most_common()
[('abc', 2), ('zzz', 2), ('def', 2)]
Cognoscenti answered 29/3, 2012 at 5:39 Comment(1)
Thanks for sharing. Also works when selecting a single cell from a pandas data frame word_freq = Counter(df.iloc[0,2].split()).most_common() Tswana
J
6
>>>> test = """abc def-ghi jkl abc
abc"""
>>> from collections import Counter
>>> words = Counter()
>>> words.update(test.split()) # Update counter with words
>>> words.most_common()        # Print list with most common to least common
[('abc', 3), ('jkl', 1), ('def-ghi', 1)]
Jacksmelt answered 29/3, 2012 at 5:38 Comment(0)
R
3

You can also use NLTK (Natural Language ToolKit). It provide very nice libraries for studying the processing the texts. for this example you can use:

from nltk import FreqDist

text = "aa bb cc aa bb"
fdist1 = FreqDist(text)

# show most 10 frequent word in the text
print fdist1.most_common(10)

the result will be:

[('aa', 2), ('bb', 2), ('cc', 1)]
Rubricate answered 6/10, 2014 at 9:11 Comment(1)
nltk might be an overkill here. #just thinking.Virilism
F
0

If you want to display frequent words and count values rather than List, then here is my code.

from collections import Counter

str = 'abc def ghi def abc abc'

arr = Counter(str.split()).most_common()

for word, count in arr:
    print(word, count)

Output:

abc 3
def 2
ghi 1
Facsimile answered 24/3, 2021 at 12:31 Comment(1)
This is the same as the accepted answer. No value added.Hilten
S
0

You can create a function to do your job:

def frequency(word):
    lst = word.split()
    result = {}
    for word in lst:
        if word not in result:
            result[word] = 0
        result[word] += 1
    return result


txt = "Fear leads to anger anger leads to hatred hatred leads to conflict conflict leads to suffering"
print(frequency(txt))

Output:

{'Fear': 1, 'leads': 4, 'to': 4, 'anger': 2, 'hatred': 2, 'conflict': 2, 'suffering': 1}
Supportive answered 2/5, 2022 at 11:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.