You asked which was more efficient. Assuming that you are talking about execution speed: If your data is small, it doesn't matter. If it is large and typical, the "already exists" case will happen much more often than the "not in dict" case. This observation explains some of the results.
Below is some code which can be used with the timeit
module to explore speed without file-reading overhead. I have taken the liberty of adding a 5th method, which is not uncompetetive and will run on any Python from at least 1.5.2 [tested] onwards.
from collections import defaultdict, Counter
def tally0(iterable):
# DOESN'T WORK -- common base case for timing
d = {}
for item in iterable:
d[item] = 1
return d
def tally1(iterable):
d = {}
for item in iterable:
if item in d:
d[item] += 1
else:
d[item] = 1
return d
def tally2(iterable):
d = {}
for item in iterable:
try:
d[item] += 1
except KeyError:
d[item] = 1
return d
def tally3(iterable):
d = defaultdict(int)
for item in iterable:
d[item] += 1
def tally4(iterable):
d = Counter()
for item in iterable:
d[item] += 1
def tally5(iterable):
d = {}
dg = d.get
for item in iterable:
d[item] = dg(item, 0) + 1
return d
Typical run (in Windows XP "Command Prompt" window):
prompt>\python27\python -mtimeit -s"t=1000*'now is the winter of our discontent made glorious summer by this son of york';import tally_bench as tb" "tb.tally1(t)"
10 loops, best of 3: 29.5 msec per loop
Here are the results (msec per loop):
0 base case 13.6
1 if k in d 29.5
2 try/except 26.1
3 defaultdict 23.4
4 Counter 79.4
5 d.get(k, 0) 29.2
Another timing trial:
prompt>\python27\python -mtimeit -s"from collections import defaultdict;d=defaultdict(int)" "d[1]+=1"
1000000 loops, best of 3: 0.309 usec per loop
prompt>\python27\python -mtimeit -s"from collections import Counter;d=Counter()" "d[1]+=1"
1000000 loops, best of 3: 1.02 usec per loop
The speed of Counter
is possibly due to it being implemented partly in Python code whereas defaultdict
is entirely in C (in 2.7, at least).
Note that Counter()
is NOT just "syntactic sugar" for defaultdict(int)
-- it implements a full bag
aka multiset
object -- see the docs for details; they may save you from reinventing the wheel if you need some fancy post-processing. If all you want to do is count things, use defaultdict
.
Update in response to a question from @Steven Rumbalski: """ I'm curious, what happens if you move the iterable into the Counter constructor: d = Counter(iterable)? (I have python 2.6 and cannot test it.) """
tally6: just does d = Count(iterable); return d
, takes 60.0 msecs
You could look at the source (collections.py in the SVN repository) ... here's what my Python27\Lib\collections.py
does when iterable
is not a Mapping instance:
self_get = self.get
for elem in iterable:
self[elem] = self_get(elem, 0) + 1
Seen that code anywhere before? There's a whole lot of carry-on just to call code that's runnable in Python 1.5.2 :-O
Counter
is slower however it has much more functionality thandefaultdict(int)
-- see my answer. – Constanta