counting duplicate words in python the fastest way [duplicate]
Asked Answered
W

5

5

I was trying to count duplicate words over a list of 230 thousand words.I used python dictionary to do so. The code is given below:

for words in word_list:
    if words in word_dict.keys():
       word_dict[words] += 1
    else:
       word_dict[words] = 1

The above code took 3 minutes!. I ran the same code over 1.5 million words and it was running for more than 25 minutes and I lost my patience and terminated. Then I found that I can use the following code from here (also shown below). The result was so surprising, it completed within seconds!. So my question is what is the faster way to do this operation?. I guess the dictionary creation process must be taking O(N) time. How was the Counter method able to complete this process in seconds and create an exact dictionary of word as key and frequency as it's value?

from collections import Counter
word_dict = Counter(word_list)
Whitecollar answered 17/1, 2013 at 8:5 Comment(3)
It might also be because word_dict.keys() gets all the keys as a list, and checking membership in a list is a O(n) operation, while checking for membership in a hashmap is much faster.Eboat
Code for collections.Counter is available hg.python.org/cpython/file/2.7/Lib/collections.pyEboat
related: Python - Is a dictionary slow to find frequency of each character?Caesarean
P
7

It's because of this:

if words in word_dict.keys():

.keys() returns a list of all the keys. Lists take linear time to scan, so your program was running in quadratic time!

Try this instead:

if words in word_dict:

Also, if you're interested, you can see the Counter implementation for yourself. It's written in regular Python.

Paschasia answered 17/1, 2013 at 8:10 Comment(1)
On Python 3 Counter can use _count_elements() helper written in CCaesarean
W
5

your dictionary counting method is not well constructed.

you could have used a defaultdict in the following way:

d = defaultdict(int)

for word in word_list:
    d[word] += 1

but the counter method from itertools is still faster even though it is doing almost the same thing, because it is written in a more efficient implementation. however, with the counter method, you need to pass it a list to count, whereas using a defaultdict, you can put sources from different locations and have a more complicated loop.

ultimately it is your preference. if counting a list, counter is the way to go, if iterating from multiple sources, or you simply want a counter in your program and dont want the extra lookup to check if an item is already being counted or not. then defaultdict is your choice.

Wert answered 17/1, 2013 at 8:14 Comment(1)
It's from library collections instead of itertoolsIthyphallic
E
1

You can actually look at the Counter code, here is the update method that is called on init:

(Notice it uses the performance trick of defining a local definition of self.get)

def update(self, iterable=None, **kwds):
    '''Like dict.update() but add counts instead of replacing them.

    Source can be an iterable, a dictionary, or another Counter instance.

    >>> c = Counter('which')
    >>> c.update('witch')           # add elements from another iterable
    >>> d = Counter('watch')
    >>> c.update(d)                 # add elements from another counter
    >>> c['h']                      # four 'h' in which, witch, and watch
    4

    '''
    # The regular dict.update() operation makes no sense here because the
    # replace behavior results in the some of original untouched counts
    # being mixed-in with all of the other counts for a mismash that
    # doesn't have a straight-forward interpretation in most counting
    # contexts.  Instead, we implement straight-addition.  Both the inputs
    # and outputs are allowed to contain zero and negative counts.

    if iterable is not None:
        if isinstance(iterable, Mapping):
            if self:
                self_get = self.get
                for elem, count in iterable.iteritems():
                    self[elem] = self_get(elem, 0) + count
            else:
                super(Counter, self).update(iterable) # fast path when counter is empty
        else:
            self_get = self.get
            for elem in iterable:
                self[elem] = self_get(elem, 0) + 1
    if kwds:
        self.update(kwds)
Euton answered 17/1, 2013 at 8:10 Comment(0)
H
0

You could also try to use defaultdict as a more competitive choice. Try:

from collections import defaultdict

word_dict = defaultdict(lambda: 0)
for word in word_list:
    word_dict[word] +=1

print word_dict
Hagi answered 17/1, 2013 at 8:14 Comment(0)
T
0

Similar to what monkut mentioned, one of the best ways to do this is to utilize the .get() function. Credit for this goes to Charles Severance and the Python For Everybody Course

For testing:

# Pretend line is as follow. 
# It can and does contain \n (newline) but not \t (tab).
line = """Your battle is my battle . We fight together . One team . One team . 
Shining sun always comes with the rays of hope . The hope is there . 
Our best days yet to come . Let the hope light the road .""".lower()

His code (with my notes):

# make an empty dictionary
# split `line` into a list. default is to split on a space character
# etc., etc.
# iterate over the LIST of words (made from splitting the string)
counts = dict()
words = line.split() 
for word in words: 
    counts[word] = counts.get(word,0) + 1

Your code:

for words in word_list:
if words in word_dict.keys():
   word_dict[words] += 1
else:
   word_dict[words] = 1

.get() does this:

  • Return the VALUE in the dictionary associated with word.
  • Otherwise (, if the word is not a key in the dictionary,) return 0.

No matter what is returned, we add 1 to it. Thus it handles the base case of seeing the word for the first time. We cannot use a dictionary comprehension, since the variable the comprehension is assigned to won't exist as we are creating that variable. Meaning

this: counts = { word:counts.get(word,0) + 1 for word in words} is not possible, since counts (is being created and assigned to at the same time. Alternatively, since) counts the variable hasn't been fully defined when we reference it (again) to .get() from it.

Output

>> counts
{'.': 8,
'always': 1,
'battle': 2,
'best': 1,
'come': 1,
'comes': 1,
'days': 1,
'fight': 1,
'hope': 3,
'is': 2,
'let': 1,
'light': 1,
'my': 1,
'of': 1,
'one': 2,
'our': 1,
'rays': 1,
'road': 1,
'shining': 1,
'sun': 1,
'team': 2,
'the': 4,
'there': 1,
'to': 1,
'together': 1,
'we': 1,
'with': 1,
'yet': 1,
'your': 1}

As an aside here is a "loaded" use of .get() that I wrote as a way to solve the classic FizzBuzz question. I'm currently writing code for a similar situation in which I will use modulus and a dictionary, but for a split string as input.

Traherne answered 30/6, 2021 at 15:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.