Find the item with maximum occurrences in a list [duplicate]
Asked Answered
J

14

79

In Python, I have a list:

L = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]  

I want to identify the item that occurred the highest number of times. I am able to solve it but I need the fastest way to do so. I know there is a nice Pythonic answer to this.

Jacks answered 8/8, 2011 at 19:10 Comment(0)
A
18

Here is a defaultdict solution that will work with Python versions 2.5 and above:

from collections import defaultdict

L = [1,2,45,55,5,4,4,4,4,4,4,5456,56,6,7,67]
d = defaultdict(int)
for i in L:
    d[i] += 1
result = max(d.iteritems(), key=lambda x: x[1])
print result
# (4, 6)
# The number 4 occurs 6 times

Note if L = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 7, 7, 7, 7, 7, 56, 6, 7, 67] then there are six 4s and six 7s. However, the result will be (4, 6) i.e. six 4s.

Agist answered 8/8, 2011 at 19:20 Comment(1)
pretty minor, but itemgetter(1) may be better than lambda x: x[1] construct in terms of both simplicity and speed. e. see docs.python.org/howto/sorting.html#operator-module-functionsClamant
H
185

I am surprised no-one has mentioned the simplest solution,max() with the key list.count:

max(lst,key=lst.count)

Example:

>>> lst = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]
>>> max(lst,key=lst.count)
4

This works in Python 3 or 2, but note that it only returns the most frequent item and not also the frequency. Also, in the case of a draw (i.e. joint most frequent item) only a single item is returned.

Although the time complexity of using max() is worse than using Counter.most_common(1) as PM 2Ring comments, the approach benefits from a rapid C implementation and I find this approach is fastest for short lists but slower for larger ones (Python 3.6 timings shown in IPython 5.3):

In [1]: from collections import Counter
   ...: 
   ...: def f1(lst):
   ...:     return max(lst, key = lst.count)
   ...: 
   ...: def f2(lst):
   ...:     return Counter(lst).most_common(1)
   ...: 
   ...: lst0 = [1,2,3,4,3]
   ...: lst1 = lst0[:] * 100
   ...: 

In [2]: %timeit -n 10 f1(lst0)
10 loops, best of 3: 3.32 us per loop

In [3]: %timeit -n 10 f2(lst0)
10 loops, best of 3: 26 us per loop

In [4]: %timeit -n 10 f1(lst1)
10 loops, best of 3: 4.04 ms per loop

In [5]: %timeit -n 10 f2(lst1)
10 loops, best of 3: 75.6 us per loop
Hyehyena answered 24/11, 2016 at 11:41 Comment(14)
I would like an explanation how max works together with key=Colostomy
That's a little inefficient, since .count has to scan the entire list for each item, making it O(n²).Ultimatum
Counter is convenient, but it's not known for speed. And when n is relatively small O(n²) can be good enough when you're using a function / method that runs at C speed. But when n is large enough, things can get ugly, as I discuss here.Ultimatum
This is a great answer, exactly what I needed and bonus points for the timings! I was trying to quickly find the outlier class in an output from tensorflow.contrib.factorization.KMeansClustering(). The output for the list(kmeans.predict_cluster_index(input_fn)) is an array with no help functions to access the cluster with the highest occurrences.Equivocation
lst = [[1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67],[1, 2, 45, 55, 5, 5, 5]] max(lst,key=lst.count) in the above example its returning me first list, however I want result like [[4],[5]]Wavelength
@Chris_Rands: great answer ! I found several approaches to this problem on this website. Approach 2 is almost identical to yours, but they first apply the set() operator to the list. I am wondering why this would work: I mean, I am removing all the duplicates from the list and THEN I am using key=list.count.. it doesn't make sense to me. Do you understand this?Itinerancy
this isn't a "little" inefficient. It's O(n²), i.e. very inefficient. If your array has only unique elements (e.g. lst = list(range(1000))), it iterates over the entire array for every element in the array.Dagoba
You could easily improve the speed by changing it to max(lst, key=set(lst).count).Millennial
lst = [1, 1, 1, 1, 4, 3, 3, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 4, 4, 4, 5, 4, 1, 4, 4, 4] as count of 4 and 10 is equal so max(lst,key=lst.count) is giving 4 but i wanted 10, how should modify this? @HyehyenaKosse
@Kosse in the case of a tie you want the highest number? max(lst, key=lambda x: (lst.count(x), x))Hyehyena
@PM2Ring: You may want to recheck Counter. If you're able to write Counter(largish_input) or mycounter.update(largish_input) as opposed to c = Counter() then looping and incrementing, where defaultdict(int) would be equivalent), it has a C accelerator for counting iterables (since 3.2) that dramatically improves performance (and it had a mild inefficiency until 3.5 that double-hashed inputs). I suspect your opinion on Counter's speed dates to before the C accelerator (or involves manual loop-and-inc code).Nabokov
A note: Counter loses more on small inputs because of the most_common call; it has a lot of wrapping that adds fixed overhead to that call (e.g. reimporting heapq lazily to avoid importing it when it's not used, using heapq at all when heapq itself uses no heap-related functionality for the n==1 case). If you inline what .most_common(1)[0][0] ultimately does, replacing return Counter(lst).most_common(1)[0][0] with return max(Counter(lst).items(), key=itemgetter(1))[0], you shave a third off the runtime (on my 3.10 install, f1 is 480 ns, f2 is 1.8 µs, inlined is 1.2 µs).Nabokov
f1 is still faster (for a five element input), but that's largely irrelevant; unless you're counting tiny inputs millions of times, the difference between 0.48 µs and 1.2 (or 1.8) µs is never going to matter; if you're worried about speed, the inputs are likely large, and replacing O(n²) work with O(n) work makes a huge difference there. The original f1(lst1) on my 3.10 install is 1.37 ms, vs. 13 µs for f2 (12.4 µs for the inlined version of most_common(1), still saving the same fixed overhead of ~600 ns). 3x savings when it doesn't matter isn't worth a 100x loss when it does.Nabokov
Thanks for that info, @Shadow. Although it can be very convenient, I've always been a bit suspicious of Counter.most_common() ;)Ultimatum
S
132
from collections import Counter
most_common,num_most_common = Counter(L).most_common(1)[0] # 4, 6 times

For older Python versions (< 2.7), you can use this recipe to create the Counter class.

Schweinfurt answered 8/8, 2011 at 19:16 Comment(2)
See the Counter docs for details.Adams
This will raise an IndexError if your list is empty.Dagoba
E
33

In your question, you asked for the fastest way to do it. As has been demonstrated repeatedly, particularly with Python, intuition is not a reliable guide: you need to measure.

Here's a simple test of several different implementations:

import sys
from collections import Counter, defaultdict
from itertools import groupby
from operator import itemgetter
from timeit import timeit

L = [1,2,45,55,5,4,4,4,4,4,4,5456,56,6,7,67]

def max_occurrences_1a(seq=L):
    "dict iteritems"
    c = dict()
    for item in seq:
        c[item] = c.get(item, 0) + 1
    return max(c.iteritems(), key=itemgetter(1))

def max_occurrences_1b(seq=L):
    "dict items"
    c = dict()
    for item in seq:
        c[item] = c.get(item, 0) + 1
    return max(c.items(), key=itemgetter(1))

def max_occurrences_2(seq=L):
    "defaultdict iteritems"
    c = defaultdict(int)
    for item in seq:
        c[item] += 1
    return max(c.iteritems(), key=itemgetter(1))

def max_occurrences_3a(seq=L):
    "sort groupby generator expression"
    return max(((k, sum(1 for i in g)) for k, g in groupby(sorted(seq))), key=itemgetter(1))

def max_occurrences_3b(seq=L):
    "sort groupby list comprehension"
    return max([(k, sum(1 for i in g)) for k, g in groupby(sorted(seq))], key=itemgetter(1))

def max_occurrences_4(seq=L):
    "counter"
    return Counter(L).most_common(1)[0]

versions = [max_occurrences_1a, max_occurrences_1b, max_occurrences_2, max_occurrences_3a, max_occurrences_3b, max_occurrences_4]

print sys.version, "\n"

for vers in versions:
    print vers.__doc__, vers(), timeit(vers, number=20000)

The results on my machine:

2.7.2 (v2.7.2:8527427914a2, Jun 11 2011, 15:22:34) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] 

dict iteritems (4, 6) 0.202214956284
dict items (4, 6) 0.208412885666
defaultdict iteritems (4, 6) 0.221301078796
sort groupby generator expression (4, 6) 0.383440971375
sort groupby list comprehension (4, 6) 0.402786016464
counter (4, 6) 0.564319133759

So it appears that the Counter solution is not the fastest. And, in this case at least, groupby is faster. defaultdict is good but you pay a little bit for its convenience; it's slightly faster to use a regular dict with a get.

What happens if the list is much bigger? Adding L *= 10000 to the test above and reducing the repeat count to 200:

dict iteritems (4, 60000) 10.3451900482
dict items (4, 60000) 10.2988479137
defaultdict iteritems (4, 60000) 5.52838587761
sort groupby generator expression (4, 60000) 11.9538850784
sort groupby list comprehension (4, 60000) 12.1327362061
counter (4, 60000) 14.7495789528

Now defaultdict is the clear winner. So perhaps the cost of the 'get' method and the loss of the inplace add adds up (an examination of the generated code is left as an exercise).

But with the modified test data, the number of unique item values did not change so presumably dict and defaultdict have an advantage there over the other implementations. So what happens if we use the bigger list but substantially increase the number of unique items? Replacing the initialization of L with:

LL = [1,2,45,55,5,4,4,4,4,4,4,5456,56,6,7,67]
L = []
for i in xrange(1,10001):
    L.extend(l * i for l in LL)

dict iteritems (2520, 13) 17.9935798645
dict items (2520, 13) 21.8974409103
defaultdict iteritems (2520, 13) 16.8289561272
sort groupby generator expression (2520, 13) 33.853593111
sort groupby list comprehension (2520, 13) 36.1303369999
counter (2520, 13) 22.626899004

So now Counter is clearly faster than the groupby solutions but still slower than the iteritems versions of dict and defaultdict.

The point of these examples isn't to produce an optimal solution. The point is that there often isn't one optimal general solution. Plus there are other performance criteria. The memory requirements will differ substantially among the solutions and, as the size of the input goes up, memory requirements may become the overriding factor in algorithm selection.

Bottom line: it all depends and you need to measure.

Endomorph answered 8/8, 2011 at 21:41 Comment(3)
Interestingly, rerunning this today on Python 3.6, it turns out that the counter seems to outperform the other approaches for long lists.Prosciutto
@moooeeeep: That's because they added a C accelerator for counting iterables in 3.2 (and further optimized it a bit in 3.5 to avoid double-hashing, which is harmless for many built-in types like small ints and str, but costly for other types); before that it was pure Python. Counter was always the simplest, and with the accelerator it's the fastest.Nabokov
@NedDeily: If you get a chance, you might want to rerun these timings on modern Python; for all but the smallest inputs (where the speed rarely matters) Counter will outperform all of these (and it works on iterators without eagerly realizing the entire input in memory, which sorted requires; peak memory ends up proportional to number of unique items, not total). For the small input, #4 loses slightly to #1/2 (beating the rest), but inlining most_common as return max(Counter(L).items(), key=itemgetter(1))[0] ties it up; for the larger input, it beats the competition by a factor of 2+.Nabokov
A
18

Here is a defaultdict solution that will work with Python versions 2.5 and above:

from collections import defaultdict

L = [1,2,45,55,5,4,4,4,4,4,4,5456,56,6,7,67]
d = defaultdict(int)
for i in L:
    d[i] += 1
result = max(d.iteritems(), key=lambda x: x[1])
print result
# (4, 6)
# The number 4 occurs 6 times

Note if L = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 7, 7, 7, 7, 7, 56, 6, 7, 67] then there are six 4s and six 7s. However, the result will be (4, 6) i.e. six 4s.

Agist answered 8/8, 2011 at 19:20 Comment(1)
pretty minor, but itemgetter(1) may be better than lambda x: x[1] construct in terms of both simplicity and speed. e. see docs.python.org/howto/sorting.html#operator-module-functionsClamant
W
10

If you're using Python 3.8 or above, you can use either statistics.mode() to return the first mode encountered or statistics.multimode() to return all the modes.

>>> import statistics
>>> data = [1, 2, 2, 3, 3, 4] 
>>> statistics.mode(data)
2
>>> statistics.multimode(data)
[2, 3]

If the list is empty, statistics.mode() throws a statistics.StatisticsError and statistics.multimode() returns an empty list.

Note before Python 3.8, statistics.mode() (introduced in 3.4) would additionally throw a statistics.StatisticsError if there is not exactly one most common value.

Wardle answered 10/10, 2019 at 0:27 Comment(0)
H
4

A simple way without any libraries or sets

def mcount(l):
  n = []                  #To store count of each elements
  for x in l:
      count = 0
      for i in range(len(l)):
          if x == l[i]:
              count+=1
      n.append(count)
  a = max(n)              #largest in counts list
  for i in range(len(n)):
      if n[i] == a:
          return(l[i],a)  #element,frequency
  return                  #if something goes wrong
Heirship answered 28/7, 2018 at 6:45 Comment(0)
E
2

Perhaps the most_common() method

Eulaheulalee answered 8/8, 2011 at 19:20 Comment(0)
D
1

I obtained the best results with groupby from itertools module with this function using Python 3.5.2:

from itertools import groupby

a = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]

def occurrence():
    occurrence, num_times = 0, 0
    for key, values in groupby(a, lambda x : x):
        val = len(list(values))
        if val >= occurrence:
            occurrence, num_times =  key, val
    return occurrence, num_times

occurrence, num_times = occurrence()
print("%d occurred %d times which is the highest number of times" % (occurrence, num_times))

Output:

4 occurred 6 times which is the highest number of times

Test with timeit from timeit module.

I used this script for my test with number= 20000:

from itertools import groupby

def occurrence():
    a = [1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]
    occurrence, num_times = 0, 0
    for key, values in groupby(a, lambda x : x):
        val = len(list(values))
        if val >= occurrence:
            occurrence, num_times =  key, val
    return occurrence, num_times

if __name__ == '__main__':
    from timeit import timeit
    print(timeit("occurrence()", setup = "from __main__ import occurrence",  number = 20000))

Output (The best one):

0.1893607140000313
Degree answered 25/11, 2016 at 21:26 Comment(0)
C
1

I want to throw in another solution that looks nice and is fast for short lists.

def mc(seq=L):
    "max/count"
    max_element = max(seq, key=seq.count)
    return (max_element, seq.count(max_element))

You can benchmark this with the code provided by Ned Deily which will give you these results for the smallest test case:

3.5.2 (default, Nov  7 2016, 11:31:36) 
[GCC 6.2.1 20160830] 

dict iteritems (4, 6) 0.2069783889998289
dict items (4, 6) 0.20462976200065896
defaultdict iteritems (4, 6) 0.2095775119996688
sort groupby generator expression (4, 6) 0.4473949929997616
sort groupby list comprehension (4, 6) 0.4367636879997008
counter (4, 6) 0.3618192010007988
max/count (4, 6) 0.20328268999946886

But beware, it is inefficient and thus gets really slow for large lists!

Chloromycetin answered 6/12, 2016 at 22:52 Comment(0)
S
1

Simple and best code:

def max_occ(lst,x):
    count=0
    for i in lst:
        if (i==x):
            count=count+1
    return count

lst=[1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]
x=max(lst,key=lst.count)
print(x,"occurs ",max_occ(lst,x),"times")

Output: 4 occurs 6 times

Stepfather answered 9/10, 2018 at 9:9 Comment(0)
A
1

My (simply) code (three months studying Python):

def more_frequent_item(lst):
    new_lst = []
    times = 0
    for item in lst:
        count_num = lst.count(item)
        new_lst.append(count_num)
        times = max(new_lst)
    key = max(lst, key=lst.count)
    print("In the list: ")
    print(lst)
    print("The most frequent item is " + str(key) + ". Appears " + str(times) + " times in this list.")


more_frequent_item([1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67])

The output will be:

In the list: 
[1, 2, 45, 55, 5, 4, 4, 4, 4, 4, 4, 5456, 56, 6, 7, 67]
The most frequent item is 4. Appears 6 times in this list.
Amalea answered 16/6, 2019 at 22:41 Comment(0)
O
1

if you are using numpy in your solution for faster computation use this:

import numpy as np
x = np.array([2,5,77,77,77,77,77,77,77,9,0,3,3,3,3,3])
y = np.bincount(x,minlength = max(x))
y = np.argmax(y)   
print(y)  #outputs 77
Ostosis answered 4/1, 2021 at 14:26 Comment(0)
A
0

Following is the solution which I came up with if there are multiple characters in the string all having the highest frequency.

mystr = input("enter string: ")
#define dictionary to store characters and their frequencies
mydict = {}
#get the unique characters
unique_chars = sorted(set(mystr),key = mystr.index)
#store the characters and their respective frequencies in the dictionary
for c in unique_chars:
    ctr = 0
    for d in mystr:
        if d != " " and d == c:
            ctr = ctr + 1
    mydict[c] = ctr
print(mydict)
#store the maximum frequency
max_freq = max(mydict.values())
print("the highest frequency of occurence: ",max_freq)
#print all characters with highest frequency
print("the characters are:")
for k,v in mydict.items():
    if v == max_freq:
        print(k)

Input: "hello people"

Output:

{'o': 2, 'p': 2, 'h': 1, ' ': 0, 'e': 3, 'l': 3}

the highest frequency of occurence: 3

the characters are:

e

l
Autostrada answered 2/7, 2017 at 23:58 Comment(0)
C
-3

may something like this:

testList = [1, 2, 3, 4, 2, 2, 1, 4, 4] print(max(set(testList), key = testList.count))

Curule answered 13/6, 2018 at 10:50 Comment(1)
a Set data structure will remove duplicates rendering your count irrelevant.Pembroke

© 2022 - 2024 — McMap. All rights reserved.