Fastest way to compute entropy in Python

Asked 16/3, 2013 at 14:1 Answered 24/10, 2021 at 18:49

In my project I need to compute the entropy of 0-1 vectors many times. Here's my code:

def entropy(labels):
    """ Computes entropy of 0-1 vector. """
    n_labels = len(labels)

    if n_labels <= 1:
        return 0

    counts = np.bincount(labels)
    probs = counts[np.nonzero(counts)] / n_labels
    n_classes = len(probs)

    if n_classes <= 1:
        return 0
    return - np.sum(probs * np.log(probs)) / np.log(n_classes)

Is there a faster way?

Arista answered 16/3, 2013 at 14:1 Comment(6)

What is a typical length of labels? – Cropper 16/3, 2013 at 14:9

The length is not fixed.. – Arista 16/3, 2013 at 14:12

It would help with benchmarking to know typical values of labels. If labels is too short, a pure python implementation could actually be faster than using NumPy. – Cropper 16/3, 2013 at 14:13

just to confirm, this question is for entropy of a discrete (binary) random variable? and not differential entropy of a continuous r.v.? – Emigrate 4/8, 2020 at 8:37

I don't know, how fast does your code run? This is opinion-based in its current format. – Lingenfelter 3/11, 2021 at 19:26

I think computing the entropy of a sequence is more involved than just that. For example if the sequence is 01010101010101010101..., then your function will say that entropy is 1 (base 2) but it is actually 0. – Collar 22/4, 2022 at 8:37

@Sanjeet Gupta answer is good but could be condensed. This question is specifically asking about the "Fastest" way but I only see times on one answer so I'll post a comparison of using scipy and numpy to the original poster's entropy2 answer with slight alterations.

Four different approaches: (1) scipy/numpy, (2) numpy/math, (3) pandas/numpy, (4) numpy

import numpy as np
from scipy.stats import entropy
from math import log, e
import pandas as pd

import timeit

def entropy1(labels, base=None):
  value,counts = np.unique(labels, return_counts=True)
  return entropy(counts, base=base)

def entropy2(labels, base=None):
  """ Computes entropy of label distribution. """

  n_labels = len(labels)

  if n_labels <= 1:
    return 0

  value,counts = np.unique(labels, return_counts=True)
  probs = counts / n_labels
  n_classes = np.count_nonzero(probs)

  if n_classes <= 1:
    return 0

  ent = 0.

  # Compute entropy
  base = e if base is None else base
  for i in probs:
    ent -= i * log(i, base)

  return ent

def entropy3(labels, base=None):
  vc = pd.Series(labels).value_counts(normalize=True, sort=False)
  base = e if base is None else base
  return -(vc * np.log(vc)/np.log(base)).sum()

def entropy4(labels, base=None):
  value,counts = np.unique(labels, return_counts=True)
  norm_counts = counts / counts.sum()
  base = e if base is None else base
  return -(norm_counts * np.log(norm_counts)/np.log(base)).sum()

Timeit operations:

repeat_number = 1000000

a = timeit.repeat(stmt='''entropy1(labels)''',
                  setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy1''',
                  repeat=3, number=repeat_number)

b = timeit.repeat(stmt='''entropy2(labels)''',
                  setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy2''',
                  repeat=3, number=repeat_number)

c = timeit.repeat(stmt='''entropy3(labels)''',
                  setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy3''',
                  repeat=3, number=repeat_number)

d = timeit.repeat(stmt='''entropy4(labels)''',
                  setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy4''',
                  repeat=3, number=repeat_number)

Timeit results:

# for loop to print out results of timeit
for approach,timeit_results in zip(['scipy/numpy', 'numpy/math', 'pandas/numpy', 'numpy'], [a,b,c,d]):
  print('Method: {}, Avg.: {:.6f}'.format(approach, np.array(timeit_results).mean()))

Method: scipy/numpy, Avg.: 63.315312
Method: numpy/math, Avg.: 49.256894
Method: pandas/numpy, Avg.: 884.644023
Method: numpy, Avg.: 60.026938

Winner: numpy/math (entropy2)

It's also worth noting that the entropy2 function above can handle numeric AND text data. ex: entropy2(list('abcdefabacdebcab')). The original poster's answer is from 2013 and had a specific use-case for binning ints but it won't work for text.

Behoof answered 13/7, 2017 at 22:37 Comment(4)

You're using such a small array that your tests are basically useless. You're really just measuring call overhead for the various interfaces. – Teflon 8/7, 2018 at 4:3

Using this code I just got the timing for my answer ("An answer that doesn't rely on numpy, either...") as well -- and it's Method: eta, Avg.: 10.461799. As someone suggested, I wonder if you're actually testing call overhead here. – Fluker 17/9, 2018 at 11:40

It's better to take the minimum of the timeit results, instead of the mean. See the "note" under the repeat function of the timeit module. – Cynthla 19/12, 2019 at 0:6

vectorizer operations may be preferred for large inputs. The 'for' loop in option 2 may not be a good option in such cases. In my view option 1 may be preferable. – Nesta 20/9, 2021 at 14:15

With the data as a pd.Series and scipy.stats, calculating the entropy of a given quantity is pretty straightforward:

import pandas as pd
import scipy.stats

def ent(data):
    """Calculates entropy of the passed `pd.Series`
    """
    p_data = data.value_counts()           # counts occurrence of each value
    entropy = scipy.stats.entropy(p_data)  # get entropy from counts
    return entropy

Note: scipy.stats will normalize the provided data, so this doesn't need to be done explicitly, i.e. passing an array of counts works fine.

Hosea answered 29/8, 2016 at 16:18 Comment(1)

The value_counts() method just counts. The probabily p_data wouldn't be p_data = data.value_counts()/sum(data.value_counts().values) So you will have the count for each class divided by the total amount of data, then you have the probability of each class. – Worley 28/3, 2022 at 14:3

An answer that doesn't rely on numpy, either:

import math
from collections import Counter

def eta(data, unit='natural'):
    base = {
        'shannon' : 2.,
        'natural' : math.exp(1),
        'hartley' : 10.
    }

    if len(data) <= 1:
        return 0

    counts = Counter()

    for d in data:
        counts[d] += 1

    ent = 0

    probs = [float(c) / len(data) for c in counts.values()]
    for p in probs:
        if p > 0.:
            ent -= p * math.log(p, base[unit])

    return ent

This will accept any datatype you could throw at it:

>>> eta(['mary', 'had', 'a', 'little', 'lamb'])
1.6094379124341005

>>> eta([c for c in "mary had a little lamb"])
2.311097886212714

The answer provided by @Jarad suggested timings as well. To that end:

repeat_number = 1000000
e = timeit.repeat(
    stmt='''eta(labels)''', 
    setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import eta''', 
    repeat=3, 
    number=repeat_number)

Timeit results: (I believe this is ~4x faster than the best numpy approach)

print('Method: {}, Avg.: {:.6f}'.format("eta", np.array(e).mean()))

Method: eta, Avg.: 10.461799

Fluker answered 17/6, 2016 at 21:44 Comment(4)

why do you need probs = [p for p in probs if p > 0.]? – Fritillary 20/8, 2018 at 5:41

Since I'm doing that test five lines later I suspect I don't need it at all :) Edited. – Fluker 17/9, 2018 at 11:28

plus one for no new dependencies – Flighty 9/4, 2020 at 15:36

Can you do counts = Counter(data) instead of looping over the characters of data? – Stepdaughter 23/10, 2020 at 4:22

Following the suggestion from unutbu I create a pure python implementation.

def entropy2(labels):
 """ Computes entropy of label distribution. """
    n_labels = len(labels)

    if n_labels <= 1:
        return 0

    counts = np.bincount(labels)
    probs = counts / n_labels
    n_classes = np.count_nonzero(probs)

    if n_classes <= 1:
        return 0

    ent = 0.

    # Compute standard entropy.
    for i in probs:
        ent -= i * log(i, base=n_classes)

    return ent

The point I was missing was that labels is a large array, however probs is 3 or 4 elements long. Using pure python my application now is twice as fast.

Arista answered 18/3, 2013 at 12:33 Comment(1)

Should 'base' be set to the number of classes? I thought there the natural log was the standard (and what you used in your original question.) – Fluker 17/6, 2016 at 21:22

Uniformly distributed data (high entropy):

s=range(0,256)

Shannon entropy calculation step by step:

import collections
import math

# calculate probability for each byte as number of occurrences / array length
probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()]
# [0.00390625, 0.00390625, 0.00390625, ...]

# calculate per-character entropy fractions
e_x = [-p_x*math.log(p_x,2) for p_x in probabilities]
# [0.03125, 0.03125, 0.03125, ...]

# sum fractions to obtain Shannon entropy
entropy = sum(e_x)
>>> entropy 
8.0

One-liner (assuming import collections):

def H(s): return sum([-p_x*math.log(p_x,2) for p_x in [n_x/len(s) for x,n_x in collections.Counter(s).items()]])

A proper function:

import collections
import math

def H(s):
    probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()]
    e_x = [-p_x*math.log(p_x,2) for p_x in probabilities]    
    return sum(e_x)

Test cases - English text taken from CyberChef entropy estimator:

>>> H(range(0,256))
8.0
>>> H(range(0,64))
6.0
>>> H(range(0,128))
7.0
>>> H([0,1])
1.0
>>> H('Standard English text usually falls somewhere between 3.5 and 5')
4.228788210509104

Putumayo answered 17/11, 2017 at 10:24 Comment(1)

This makes it very clear regarding ability to calculate entropy over a specified range of values. I need to apply this method to the 8-connected area around a pixel and their grayscale values. Wondering if I could do with a built-in method as well. – Porras 12/11, 2019 at 8:9

Here is my approach:

labels = [0, 0, 1, 1]

from collections import Counter
from scipy import stats

stats.entropy(list(Counter(labels).values()), base=2)

Nostalgia answered 10/6, 2018 at 11:13 Comment(1)

This seems to work for my image slices but I actually need the probability of pixel values in the slice from 0 to 255. – Porras 12/11, 2019 at 8:52

My favorite function for entropy is the following:

def entropy(labels):
    prob_dict = {x:labels.count(x)/len(labels) for x in labels}
    probs = np.array(list(prob_dict.values()))

    return - probs.dot(np.log2(probs))

I am still looking for a nicer way to avoid the dict -> values -> list -> np.array conversion. Will comment again if I found it.

Cavite answered 17/2, 2017 at 10:23 Comment(2)

nice, use collections.Counter would be better. – Habituate 20/5, 2017 at 16:22

In python2, labels.count(x)/len(labels) should be labels.count(x)/float(len(labels)) – Rizzo 10/2, 2019 at 10:26

This method extends the other solutions by allowing for binning. For example, bin=None (default) won't bin x and will compute an empirical probability for each element of x, while bin=256 chunks x into 256 bins before computing the empirical probabilities.

import numpy as np

def entropy(x, bins=None):
    N   = x.shape[0]
    if bins is None:
        counts = np.bincount(x)
    else:
        counts = np.histogram(x, bins=bins)[0] # 0th idx is counts
    p   = counts[np.nonzero(counts)]/N # avoids log(0)
    H   = -np.dot( p, np.log2(p) )
    return H

Chambless answered 8/1, 2020 at 21:2 Comment(0)

This is the fastest Python implementation I've found so far:

import numpy as np

def entropy(labels):
    ps = np.bincount(labels) / len(labels)
    return -np.sum([p * np.log2(p) for p in ps if p > 0])

Achorn answered 22/9, 2021 at 14:9 Comment(1)

Would have been better in addition to the exemple to test it out and compare it, wouldn't think so? could you perhaps do it? – Mckamey 23/2 at 1:41

from collections import Counter
from scipy import stats

labels = [0.9, 0.09, 0.1]
stats.entropy(list(Counter(labels).keys()), base=2)

Discriminating answered 28/6, 2019 at 15:47 Comment(1)

While this may answer the question, code only answers are generally regarded as lo-quality. Providing some more description and context on why will improve the quality of this answer. Thanks. – Teniafuge 28/6, 2019 at 23:42

BiEntropy wont be the fastest way of computing entropy, but it is rigorous and builds upon Shannon Entropy in a well defined way. It has been tested in various fields including image related applications. It is implemented in Python on Github.

Marquardt answered 17/1, 2020 at 22:27 Comment(0)

Bit late for the party, but I stumbled at this and all answers seems to rely on Kullback–Leibler divergence, which has no upper bound, and hence, doesn't fit my needs.

Here I have an approximation (the TODO!could be improved) of an entropy function that goes from [0,1].

It calculates the biass of a single column.

class Pandas_Dataframe_helper:
    #some other methods here...
    @staticmethod
    def column_biass(df_column):
        df_column_as_list           =   list(df_column)
        N                           =   len(df_column_as_list)
        values,counts               =   np.unique(df_column_as_list, return_counts=True)
        #generate synth list (TODO! what if not even number? Minimum Comun Multiple of(num_different_labels,[x for x in counts]))
        num_different_labels        =   len(values)
        num_items_per_label         =   N // num_different_labels
        synthetic_list              =   []
        for current_value in values:
            synthetic_list.extend([current_value] * num_items_per_label)
        #TODO! aproximacion
        if(len(synthetic_list) != len(df_column_as_list)):
            synthetic_list.extend([current_value] * (len(df_column_as_list) - len(synthetic_list)))
        #now, extrapolate differences between sorted-input-list and synsthetic_list
        df_column_as_list_sorted    =   sorted(df_column_as_list)
        counter_unmatches           =   0
        for i in range(0,N):
            if(df_column_as_list_sorted[i] != synthetic_list[i]):
                counter_unmatches   +=  1
        #upper_bound = g(N,num_different_labels)
        #((K-1)M)-1 K==num_different_labels , M==num theorically perfect distribution's items per label 
        upper_bound                 =   ((num_different_labels-1)*num_items_per_label)-1
        return counter_unmatches/upper_bound
    #---------------------------------------------------------------------------------------------------------------------

Complete code at https://github.com/glezo1/pcommonlibs/blob/master/com/glezo/pandas_dataframe_helper/Pandas_Dataframe_Helper.py

Saturnian answered 24/10, 2021 at 18:49 Comment(0)

The above answer is good, but if you need a version that can operate along different axes, here's a working implementation.

def entropy(A, axis=None):
    """Computes the Shannon entropy of the elements of A. Assumes A is 
    an array-like of nonnegative ints whose max value is approximately 
    the number of unique values present.

    >>> a = [0, 1]
    >>> entropy(a)
    1.0
    >>> A = np.c_[a, a]
    >>> entropy(A)
    1.0
    >>> A                   # doctest: +NORMALIZE_WHITESPACE
    array([[0, 0], [1, 1]])
    >>> entropy(A, axis=0)  # doctest: +NORMALIZE_WHITESPACE
    array([ 1., 1.])
    >>> entropy(A, axis=1)  # doctest: +NORMALIZE_WHITESPACE
    array([[ 0.], [ 0.]])
    >>> entropy([0, 0, 0])
    0.0
    >>> entropy([])
    0.0
    >>> entropy([5])
    0.0
    """
    if A is None or len(A) < 2:
        return 0.

    A = np.asarray(A)

    if axis is None:
        A = A.flatten()
        counts = np.bincount(A) # needs small, non-negative ints
        counts = counts[counts > 0]
        if len(counts) == 1:
            return 0. # avoid returning -0.0 to prevent weird doctests
        probs = counts / float(A.size)
        return -np.sum(probs * np.log2(probs))
    elif axis == 0:
        entropies = map(lambda col: entropy(col), A.T)
        return np.array(entropies)
    elif axis == 1:
        entropies = map(lambda row: entropy(row), A)
        return np.array(entropies).reshape((-1, 1))
    else:
        raise ValueError("unsupported axis: {}".format(axis))

Demigod answered 27/8, 2015 at 23:45 Comment(0)

-1

def entropy(base, prob_a, prob_b ):
  import math
  base=2
  x=prob_a
  y=prob_b
  expression =-((x*math.log(x,base)+(y*math.log(y,base))))    
  return [expression]

Struma answered 28/12, 2019 at 23:42 Comment(2)

When you answer with code, you should write some explanation. – Malvasia 29/12, 2019 at 0:12

In the question the argument is a label list – Vinylidene 5/4, 2021 at 23:31

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags