Fastest way to compute entropy in Python
Asked Answered
A

14

74

In my project I need to compute the entropy of 0-1 vectors many times. Here's my code:

def entropy(labels):
    """ Computes entropy of 0-1 vector. """
    n_labels = len(labels)

    if n_labels <= 1:
        return 0

    counts = np.bincount(labels)
    probs = counts[np.nonzero(counts)] / n_labels
    n_classes = len(probs)

    if n_classes <= 1:
        return 0
    return - np.sum(probs * np.log(probs)) / np.log(n_classes)

Is there a faster way?

Arista answered 16/3, 2013 at 14:1 Comment(6)
What is a typical length of labels?Cropper
The length is not fixed..Arista
It would help with benchmarking to know typical values of labels. If labels is too short, a pure python implementation could actually be faster than using NumPy.Cropper
just to confirm, this question is for entropy of a discrete (binary) random variable? and not differential entropy of a continuous r.v.?Emigrate
I don't know, how fast does your code run? This is opinion-based in its current format.Lingenfelter
I think computing the entropy of a sequence is more involved than just that. For example if the sequence is 01010101010101010101..., then your function will say that entropy is 1 (base 2) but it is actually 0.Collar
B
59

@Sanjeet Gupta answer is good but could be condensed. This question is specifically asking about the "Fastest" way but I only see times on one answer so I'll post a comparison of using scipy and numpy to the original poster's entropy2 answer with slight alterations.

Four different approaches: (1) scipy/numpy, (2) numpy/math, (3) pandas/numpy, (4) numpy

import numpy as np
from scipy.stats import entropy
from math import log, e
import pandas as pd

import timeit

def entropy1(labels, base=None):
  value,counts = np.unique(labels, return_counts=True)
  return entropy(counts, base=base)

def entropy2(labels, base=None):
  """ Computes entropy of label distribution. """

  n_labels = len(labels)

  if n_labels <= 1:
    return 0

  value,counts = np.unique(labels, return_counts=True)
  probs = counts / n_labels
  n_classes = np.count_nonzero(probs)

  if n_classes <= 1:
    return 0

  ent = 0.

  # Compute entropy
  base = e if base is None else base
  for i in probs:
    ent -= i * log(i, base)

  return ent

def entropy3(labels, base=None):
  vc = pd.Series(labels).value_counts(normalize=True, sort=False)
  base = e if base is None else base
  return -(vc * np.log(vc)/np.log(base)).sum()

def entropy4(labels, base=None):
  value,counts = np.unique(labels, return_counts=True)
  norm_counts = counts / counts.sum()
  base = e if base is None else base
  return -(norm_counts * np.log(norm_counts)/np.log(base)).sum()
    

Timeit operations:

repeat_number = 1000000

a = timeit.repeat(stmt='''entropy1(labels)''',
                  setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy1''',
                  repeat=3, number=repeat_number)

b = timeit.repeat(stmt='''entropy2(labels)''',
                  setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy2''',
                  repeat=3, number=repeat_number)

c = timeit.repeat(stmt='''entropy3(labels)''',
                  setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy3''',
                  repeat=3, number=repeat_number)

d = timeit.repeat(stmt='''entropy4(labels)''',
                  setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy4''',
                  repeat=3, number=repeat_number)

Timeit results:

# for loop to print out results of timeit
for approach,timeit_results in zip(['scipy/numpy', 'numpy/math', 'pandas/numpy', 'numpy'], [a,b,c,d]):
  print('Method: {}, Avg.: {:.6f}'.format(approach, np.array(timeit_results).mean()))

Method: scipy/numpy, Avg.: 63.315312
Method: numpy/math, Avg.: 49.256894
Method: pandas/numpy, Avg.: 884.644023
Method: numpy, Avg.: 60.026938

Winner: numpy/math (entropy2)

It's also worth noting that the entropy2 function above can handle numeric AND text data. ex: entropy2(list('abcdefabacdebcab')). The original poster's answer is from 2013 and had a specific use-case for binning ints but it won't work for text.

Behoof answered 13/7, 2017 at 22:37 Comment(4)
You're using such a small array that your tests are basically useless. You're really just measuring call overhead for the various interfaces.Teflon
Using this code I just got the timing for my answer ("An answer that doesn't rely on numpy, either...") as well -- and it's Method: eta, Avg.: 10.461799. As someone suggested, I wonder if you're actually testing call overhead here.Fluker
It's better to take the minimum of the timeit results, instead of the mean. See the "note" under the repeat function of the timeit module.Cynthla
vectorizer operations may be preferred for large inputs. The 'for' loop in option 2 may not be a good option in such cases. In my view option 1 may be preferable.Nesta
H
39

With the data as a pd.Series and scipy.stats, calculating the entropy of a given quantity is pretty straightforward:

import pandas as pd
import scipy.stats

def ent(data):
    """Calculates entropy of the passed `pd.Series`
    """
    p_data = data.value_counts()           # counts occurrence of each value
    entropy = scipy.stats.entropy(p_data)  # get entropy from counts
    return entropy

Note: scipy.stats will normalize the provided data, so this doesn't need to be done explicitly, i.e. passing an array of counts works fine.

Hosea answered 29/8, 2016 at 16:18 Comment(1)
The value_counts() method just counts. The probabily p_data wouldn't be p_data = data.value_counts()/sum(data.value_counts().values) So you will have the count for each class divided by the total amount of data, then you have the probability of each class.Worley
F
21

An answer that doesn't rely on numpy, either:

import math
from collections import Counter

def eta(data, unit='natural'):
    base = {
        'shannon' : 2.,
        'natural' : math.exp(1),
        'hartley' : 10.
    }

    if len(data) <= 1:
        return 0

    counts = Counter()

    for d in data:
        counts[d] += 1

    ent = 0

    probs = [float(c) / len(data) for c in counts.values()]
    for p in probs:
        if p > 0.:
            ent -= p * math.log(p, base[unit])

    return ent

This will accept any datatype you could throw at it:

>>> eta(['mary', 'had', 'a', 'little', 'lamb'])
1.6094379124341005

>>> eta([c for c in "mary had a little lamb"])
2.311097886212714

The answer provided by @Jarad suggested timings as well. To that end:

repeat_number = 1000000
e = timeit.repeat(
    stmt='''eta(labels)''', 
    setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import eta''', 
    repeat=3, 
    number=repeat_number)

Timeit results: (I believe this is ~4x faster than the best numpy approach)

print('Method: {}, Avg.: {:.6f}'.format("eta", np.array(e).mean()))

Method: eta, Avg.: 10.461799
Fluker answered 17/6, 2016 at 21:44 Comment(4)
why do you need probs = [p for p in probs if p > 0.]?Fritillary
Since I'm doing that test five lines later I suspect I don't need it at all :) Edited.Fluker
plus one for no new dependenciesFlighty
Can you do counts = Counter(data) instead of looping over the characters of data?Stepdaughter
A
12

Following the suggestion from unutbu I create a pure python implementation.

def entropy2(labels):
 """ Computes entropy of label distribution. """
    n_labels = len(labels)

    if n_labels <= 1:
        return 0

    counts = np.bincount(labels)
    probs = counts / n_labels
    n_classes = np.count_nonzero(probs)

    if n_classes <= 1:
        return 0

    ent = 0.

    # Compute standard entropy.
    for i in probs:
        ent -= i * log(i, base=n_classes)

    return ent

The point I was missing was that labels is a large array, however probs is 3 or 4 elements long. Using pure python my application now is twice as fast.

Arista answered 18/3, 2013 at 12:33 Comment(1)
Should 'base' be set to the number of classes? I thought there the natural log was the standard (and what you used in your original question.)Fluker
P
10

Uniformly distributed data (high entropy):

s=range(0,256)

Shannon entropy calculation step by step:

import collections
import math

# calculate probability for each byte as number of occurrences / array length
probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()]
# [0.00390625, 0.00390625, 0.00390625, ...]

# calculate per-character entropy fractions
e_x = [-p_x*math.log(p_x,2) for p_x in probabilities]
# [0.03125, 0.03125, 0.03125, ...]

# sum fractions to obtain Shannon entropy
entropy = sum(e_x)
>>> entropy 
8.0

One-liner (assuming import collections):

def H(s): return sum([-p_x*math.log(p_x,2) for p_x in [n_x/len(s) for x,n_x in collections.Counter(s).items()]])

A proper function:

import collections
import math

def H(s):
    probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()]
    e_x = [-p_x*math.log(p_x,2) for p_x in probabilities]    
    return sum(e_x)

Test cases - English text taken from CyberChef entropy estimator:

>>> H(range(0,256))
8.0
>>> H(range(0,64))
6.0
>>> H(range(0,128))
7.0
>>> H([0,1])
1.0
>>> H('Standard English text usually falls somewhere between 3.5 and 5')
4.228788210509104
Putumayo answered 17/11, 2017 at 10:24 Comment(1)
This makes it very clear regarding ability to calculate entropy over a specified range of values. I need to apply this method to the 8-connected area around a pixel and their grayscale values. Wondering if I could do with a built-in method as well.Porras
N
10

Here is my approach:

labels = [0, 0, 1, 1]

from collections import Counter
from scipy import stats

stats.entropy(list(Counter(labels).values()), base=2)
Nostalgia answered 10/6, 2018 at 11:13 Comment(1)
This seems to work for my image slices but I actually need the probability of pixel values in the slice from 0 to 255.Porras
C
9

My favorite function for entropy is the following:

def entropy(labels):
    prob_dict = {x:labels.count(x)/len(labels) for x in labels}
    probs = np.array(list(prob_dict.values()))

    return - probs.dot(np.log2(probs))

I am still looking for a nicer way to avoid the dict -> values -> list -> np.array conversion. Will comment again if I found it.

Cavite answered 17/2, 2017 at 10:23 Comment(2)
nice, use collections.Counter would be better.Habituate
In python2, labels.count(x)/len(labels) should be labels.count(x)/float(len(labels))Rizzo
C
2

This method extends the other solutions by allowing for binning. For example, bin=None (default) won't bin x and will compute an empirical probability for each element of x, while bin=256 chunks x into 256 bins before computing the empirical probabilities.

import numpy as np

def entropy(x, bins=None):
    N   = x.shape[0]
    if bins is None:
        counts = np.bincount(x)
    else:
        counts = np.histogram(x, bins=bins)[0] # 0th idx is counts
    p   = counts[np.nonzero(counts)]/N # avoids log(0)
    H   = -np.dot( p, np.log2(p) )
    return H 
Chambless answered 8/1, 2020 at 21:2 Comment(0)
A
2

This is the fastest Python implementation I've found so far:

import numpy as np

def entropy(labels):
    ps = np.bincount(labels) / len(labels)
    return -np.sum([p * np.log2(p) for p in ps if p > 0])
Achorn answered 22/9, 2021 at 14:9 Comment(1)
Would have been better in addition to the exemple to test it out and compare it, wouldn't think so? could you perhaps do it?Mckamey
D
1
from collections import Counter
from scipy import stats

labels = [0.9, 0.09, 0.1]
stats.entropy(list(Counter(labels).keys()), base=2)
Discriminating answered 28/6, 2019 at 15:47 Comment(1)
While this may answer the question, code only answers are generally regarded as lo-quality. Providing some more description and context on why will improve the quality of this answer. Thanks.Teniafuge
M
1

BiEntropy wont be the fastest way of computing entropy, but it is rigorous and builds upon Shannon Entropy in a well defined way. It has been tested in various fields including image related applications. It is implemented in Python on Github.

Marquardt answered 17/1, 2020 at 22:27 Comment(0)
S
1

Bit late for the party, but I stumbled at this and all answers seems to rely on Kullback–Leibler divergence, which has no upper bound, and hence, doesn't fit my needs.

Here I have an approximation (the TODO!could be improved) of an entropy function that goes from [0,1].

It calculates the biass of a single column.

class Pandas_Dataframe_helper:
    #some other methods here...
    @staticmethod
    def column_biass(df_column):
        df_column_as_list           =   list(df_column)
        N                           =   len(df_column_as_list)
        values,counts               =   np.unique(df_column_as_list, return_counts=True)
        #generate synth list (TODO! what if not even number? Minimum Comun Multiple of(num_different_labels,[x for x in counts]))
        num_different_labels        =   len(values)
        num_items_per_label         =   N // num_different_labels
        synthetic_list              =   []
        for current_value in values:
            synthetic_list.extend([current_value] * num_items_per_label)
        #TODO! aproximacion
        if(len(synthetic_list) != len(df_column_as_list)):
            synthetic_list.extend([current_value] * (len(df_column_as_list) - len(synthetic_list)))
        #now, extrapolate differences between sorted-input-list and synsthetic_list
        df_column_as_list_sorted    =   sorted(df_column_as_list)
        counter_unmatches           =   0
        for i in range(0,N):
            if(df_column_as_list_sorted[i] != synthetic_list[i]):
                counter_unmatches   +=  1
        #upper_bound = g(N,num_different_labels)
        #((K-1)M)-1 K==num_different_labels , M==num theorically perfect distribution's items per label 
        upper_bound                 =   ((num_different_labels-1)*num_items_per_label)-1
        return counter_unmatches/upper_bound
    #---------------------------------------------------------------------------------------------------------------------

Complete code at https://github.com/glezo1/pcommonlibs/blob/master/com/glezo/pandas_dataframe_helper/Pandas_Dataframe_Helper.py

Saturnian answered 24/10, 2021 at 18:49 Comment(0)
D
0

The above answer is good, but if you need a version that can operate along different axes, here's a working implementation.

def entropy(A, axis=None):
    """Computes the Shannon entropy of the elements of A. Assumes A is 
    an array-like of nonnegative ints whose max value is approximately 
    the number of unique values present.

    >>> a = [0, 1]
    >>> entropy(a)
    1.0
    >>> A = np.c_[a, a]
    >>> entropy(A)
    1.0
    >>> A                   # doctest: +NORMALIZE_WHITESPACE
    array([[0, 0], [1, 1]])
    >>> entropy(A, axis=0)  # doctest: +NORMALIZE_WHITESPACE
    array([ 1., 1.])
    >>> entropy(A, axis=1)  # doctest: +NORMALIZE_WHITESPACE
    array([[ 0.], [ 0.]])
    >>> entropy([0, 0, 0])
    0.0
    >>> entropy([])
    0.0
    >>> entropy([5])
    0.0
    """
    if A is None or len(A) < 2:
        return 0.

    A = np.asarray(A)

    if axis is None:
        A = A.flatten()
        counts = np.bincount(A) # needs small, non-negative ints
        counts = counts[counts > 0]
        if len(counts) == 1:
            return 0. # avoid returning -0.0 to prevent weird doctests
        probs = counts / float(A.size)
        return -np.sum(probs * np.log2(probs))
    elif axis == 0:
        entropies = map(lambda col: entropy(col), A.T)
        return np.array(entropies)
    elif axis == 1:
        entropies = map(lambda row: entropy(row), A)
        return np.array(entropies).reshape((-1, 1))
    else:
        raise ValueError("unsupported axis: {}".format(axis))
Demigod answered 27/8, 2015 at 23:45 Comment(0)
S
-1
def entropy(base, prob_a, prob_b ):
  import math
  base=2
  x=prob_a
  y=prob_b
  expression =-((x*math.log(x,base)+(y*math.log(y,base))))    
  return [expression]
Struma answered 28/12, 2019 at 23:42 Comment(2)
When you answer with code, you should write some explanation.Malvasia
In the question the argument is a label listVinylidene

© 2022 - 2024 — McMap. All rights reserved.