How to calculate the entropy of a file?
Asked Answered
R

12

84

How to calculate the entropy of a file? (Or let's just say a bunch of bytes)
I have an idea, but I'm not sure that it's mathematically correct.

My idea is the following:

  • Create an array of 256 integers (all zeros).
  • Traverse through the file and for each of its bytes,
    increment the corresponding position in the array.
  • At the end: Calculate the "average" value for the array.
  • Initialize a counter with zero,
    and for each of the array's entries:
    add the entry's difference to "average" to the counter.

Well, now I'm stuck. How to "project" the counter result in such a way that all results would lie between 0.0 and 1.0? But I'm sure, the idea is inconsistent anyway...

I hope someone has better and simpler solutions?

Note: I need the whole thing to make assumptions on the file's contents:
(plaintext, markup, compressed or some binary, ...)

Rubric answered 13/6, 2009 at 10:23 Comment(5)
Check out Scanning data for entropy anomaliesDirectly
You mean a metric entropy? entropy divided by length of messageCelisse
Ouch, that note you added: Note: I need the whole thing to make assumptions on the file's contents: (plaintext, markup, compressed or some binary, ...) ... You just asked for godlike magic, good luck with developing provably optimal data compression.Catarinacatarrh
Can you publish, please a pseudo code of your final result?Wizardly
related What Linux software can I use to explore entropy of a file? -> binwalk, radare2Sylvan
U
52
  • At the end: Calculate the "average" value for the array.
  • Initialize a counter with zero, and for each of the array's entries: add the entry's difference to "average" to the counter.

With some modifications you can get Shannon's entropy:

rename "average" to "entropy"

(float) entropy = 0
for i in the array[256]:Counts do 
  (float)p = Counts[i] / filesize
  if (p > 0) entropy = entropy - p*lg(p) // lgN is the logarithm with base 2

Edit: As Wesley mentioned, we must divide entropy by 8 in order to adjust it in the range 0 . . 1 (or alternatively, we can use the logarithmic base 256).

Uel answered 13/6, 2009 at 11:5 Comment(11)
One correction: you need to skip the elements with Counts[i] == 0.Delaine
You are right Krivokon, thanks! I see that Wesley did it correctly except that he choose a 'weird' logarithm base.Uel
Yes, it's definitely weird. However, since you're using the more conventional log base 2, you get a value between 0 and 8. You may want to mention this so that the asker can remember to divide the result by 8 to get a value between 0 and 1. (Congrats on the quick answer though - I had to look this stuff up on Wikipedia to remember it. :P)Phile
This is a good method, I used it to analyse the "entropy" of image by comparing the pixel data and it gave good results.Platte
@Nick: can it be applied here: https://mcmap.net/q/246684/-picture-entropy-calculationMarj
@Daniel: you could try entropy testing but I think there is a problem. You should process data of the same size. ie 8-bit, 16-bit, etc. If they have variable bit size then I don't think you'll get meaningful results.Uel
@Nick: I have bytes, so I guess it'll fit nicely.Marj
i ran this method and got a entropy value of 1.3729 for a file. Does entropy have to be withing a certain range like probability values ? Is the value i got a valid entropy value ?Astraddle
@user581544 the range depends on the log function. If you use log2 you can divide the result by 8 to adjust it between 0..1 - see Wesley's answerUel
This estimation of the entropy assumes that the bytes are independent, which in general is wrong. For example, take a grayscale image with a uniform horizontal gradient from white to black.Thinnish
This is some kind of very rough estimate, and will fail for example for file containing abcd...z repeated 1 million times. Compressing with gzip gives much better results. Divide compressed size by uncompressed size and you will get a normalized entropy.Celisse
D
38

A simpler solution: gzip the file. Use the ratio of file sizes: (size-of-gzipped)/(size-of-original) as measure of randomness (i.e. entropy).

This method doesn't give you the exact absolute value of entropy (because gzip is not an "ideal" compressor), but it's good enough if you need to compare entropy of different sources.

Delaine answered 13/6, 2009 at 10:38 Comment(6)
I had also that idea (as last option), but I need to analyze a lot of files, so gzipping ALL of them is not an efficient option.Rubric
It depends on how huge is your ALL. I just tried to gzip all the files in /usr/bin, it's about 1000 files, 200Mb. It took about 7 sec. This is the command you once can use to get the size: cat * | gzip --fast | wc -c. It's slower than just reading the files byte-by-byte, but not by much.Delaine
gzip's has had many man-years of programming effort that much optimization. Might as well take advantage of it.Swamp
This actually can be a better estimation of the entropy than that of the accepted answer - specially if the file is large.Thinnish
I agree this is a better estimation than the accepted answer. In fact, there are several academic papers that use this type of approximation.Leinster
@HugoSFerreira could you point to a specific paper using this? I'd like to cite this method.Tali
P
33

To calculate the information entropy of a collection of bytes, you'll need to do something similar to tydok's answer. (tydok's answer works on a collection of bits.)

The following variables are assumed to already exist:

  • byte_counts is 256-element list of the number of bytes with each value in your file. For example, byte_counts[2] is the number of bytes that have the value 2.

  • total is the total number of bytes in your file.

I'll write the following code in Python, but it should be obvious what's going on.

import math

entropy = 0

for count in byte_counts:
    # If no bytes of this value were seen in the value, it doesn't affect
    # the entropy of the file.
    if count == 0:
        continue
    # p is the probability of seeing this byte in the file, as a floating-
    # point number
    p = 1.0 * count / total
    entropy -= p * math.log(p, 256)

There are several things that are important to note.

  • The check for count == 0 is not just an optimization. If count == 0, then p == 0, and log(p) will be undefined ("negative infinity"), causing an error.

  • The 256 in the call to math.log represents the number of discrete values that are possible. A byte composed of eight bits will have 256 possible values.

The resulting value will be between 0 (every single byte in the file is the same) up to 1 (the bytes are evenly divided among every possible value of a byte).


An explanation for the use of log base 256

It is true that this algorithm is usually applied using log base 2. This gives the resulting answer in bits. In such a case, you have a maximum of 8 bits of entropy for any given file. Try it yourself: maximize the entropy of the input by making byte_counts a list of all 1 or 2 or 100. When the bytes of a file are evenly distributed, you'll find that there is an entropy of 8 bits.

It is possible to use other logarithm bases. Using b=2 allows a result in bits, as each bit can have 2 values. Using b=10 puts the result in dits, or decimal bits, as there are 10 possible values for each dit. Using b=256 will give the result in bytes, as each byte can have one of 256 discrete values.

Interestingly, using log identities, you can work out how to convert the resulting entropy between units. Any result obtained in units of bits can be converted to units of bytes by dividing by 8. As an interesting, intentional side-effect, this gives the entropy as a value between 0 and 1.

In summary:

  • You can use various units to express entropy
  • Most people express entropy in bits (b=2)
    • For a collection of bytes, this gives a maximum entropy of 8 bits
    • Since the asker wants a result between 0 and 1, divide this result by 8 for a meaningful value
  • The algorithm above calculates entropy in bytes (b=256)
    • This is equivalent to (entropy in bits) / 8
    • This already gives a value between 0 and 1
Phile answered 13/6, 2009 at 13:14 Comment(1)
This is not the entropy, this assumes the bytes are indepent. See my comment to Nick's answerThinnish
A
26

I'm two years late in answering, so please consider this despite only a few up-votes.

Short answer: use my 1st and 3rd bold equations below to get what most people are thinking about when they say "entropy" of a file in bits. Use just 1st equation if you want Shannon's H entropy which is actually entropy/symbol as he stated 13 times in his paper which most people are not aware of. Some online entropy calculators use this one, but Shannon's H is "specific entropy", not "total entropy" which has caused so much confusion. Use 1st and 2nd equation if you want the answer between 0 and 1 which is normalized entropy/symbol (it's not bits/symbol, but a true statistical measure of the "entropic nature" of the data by letting the data choose its own log base instead of arbitrarily assigning 2, e, or 10).

There 4 types of entropy of files (data) of N symbols long with n unique types of symbols. But keep in mind that by knowing the contents of a file, you know the state it is in and therefore S=0. To be precise, if you have a source that generates a lot of data that you have access to, then you can calculate the expected future entropy/character of that source. If you use the following on a file, it is more accurate to say it is estimating the expected entropy of other files from that source.

  • Shannon (specific) entropy H = -1*sum(count_i / N * log(count_i / N))
    where count_i is the number of times symbol i occured in N.
    Units are bits/symbol if log is base 2, nats/symbol if natural log.
  • Normalized specific entropy: H / log(n)
    Units are entropy/symbol. Ranges from 0 to 1. 1 means each symbol occurred equally often and near 0 is where all symbols except 1 occurred only once, and the rest of a very long file was the other symbol. The log is in the same base as the H.
  • Absolute entropy S = N * H
    Units are bits if log is base 2, nats if ln()).
  • Normalized absolute entropy S = N * H / log(n)
    Unit is "entropy", varies from 0 to N. The log is in the same base as the H.

Although the last one is the truest "entropy", the first one (Shannon entropy H) is what all books call "entropy" without (the needed IMHO) qualification. Most do not clarify (like Shannon did) that it is bits/symbol or entropy per symbol. Calling H "entropy" is speaking too loosely.

For files with equal frequency of each symbol: S = N * H = N. This is the case for most large files of bits. Entropy does not do any compression on the data and is thereby completely ignorant of any patterns, so 000000111111 has the same H and S as 010111101000 (6 1's and 6 0's in both cases).

Like others have said, using a standard compression routine like gzip and dividing before and after will give a better measure of the amount of pre-existing "order" in the file, but that is biased against data that fits the compression scheme better. There's no general purpose perfectly optimized compressor that we can use to define an absolute "order".

Another thing to consider: H changes if you change how you express the data. H will be different if you select different groupings of bits (bits, nibbles, bytes, or hex). So you divide by log(n) where n is the number of unique symbols in the data (2 for binary, 256 for bytes) and H will range from 0 to 1 (this is normalized intensive Shannon entropy in units of entropy per symbol). But technically if only 100 of the 256 types of bytes occur, then n=100, not 256.

H is an "intensive" entropy, i.e. it is per symbol which is analogous to specific entropy in physics which is entropy per kg or per mole. Regular "extensive" entropy of a file analogous to physics' S is S=N*H where N is the number of symbols in the file. H would be exactly analogous to a portion of an ideal gas volume. Information entropy can't simply be made exactly equal to physical entropy in a deeper sense because physical entropy allows for "ordered" as well disordered arrangements: physical entropy comes out more than a completely random entropy (such as a compressed file). One aspect of the different For an ideal gas there is a additional 5/2 factor to account for this: S = k * N * (H+5/2) where H = possible quantum states per molecule = (xp)^3/hbar * 2 * sigma^2 where x=width of the box, p=total non-directional momentum in the system (calculated from kinetic energy and mass per molecule), and sigma=0.341 in keeping with uncertainty principle giving only the number of possible states within 1 std dev.

A little math gives a shorter form of normalized extensive entropy for a file:

S=N * H / log(n) = sum(count_i*log(N/count_i))/log(n)

Units of this are "entropy" (which is not really a unit). It is normalized to be a better universal measure than the "entropy" units of N * H. But it also should not be called "entropy" without clarification because the normal historical convention is to erringly call H "entropy" (which is contrary to the clarifications made in Shannon's text).

Asclepiadean answered 21/1, 2016 at 12:43 Comment(2)
I want to upvote your answer, but there is some ambiguity you should clear up first: In equation 2 and 4, and where you say "So you divide by log(n) where n is the number of unique symbols in the data", log what of n? Log natural, log2(n)? Generally in mathematics, with no base specified, log(n) means log10(n). Please clarify.Laden
I mentioned in equations 1 and 3 the user selects the base. For equations 2 and 4 it should be that same base (that H was in). I'll add the clarification.Asclepiadean
J
23

For what it's worth, here's the traditional (bits of entropy) calculation represented in C#:

/// <summary>
/// returns bits of entropy represented in a given string, per 
/// http://en.wikipedia.org/wiki/Entropy_(information_theory) 
/// </summary>
public static double ShannonEntropy(string s)
{
    var map = new Dictionary<char, int>();
    foreach (char c in s)
    {
        if (!map.ContainsKey(c))
            map.Add(c, 1);
        else
            map[c] += 1;
    }

    double result = 0.0;
    int len = s.Length;
    foreach (var item in map)
    {
        var frequency = (double)item.Value / len;
        result -= frequency * (Math.Log(frequency) / Math.Log(2));
    }

    return result;
}
Joachim answered 21/2, 2011 at 22:1 Comment(5)
This is a fantastic answer. To expand on the original question, how would you compute it if the answers were relative rather than absolute? For example, suppose you were looking for geographical entropy; an ad campaign runs nationally, and you capture the geo coordinates of respondents. No two entries are likely to have identical coordinates, but some entropy function should still be able to tell you that there are likely to be a few localized hotspots, or that a blanketed national distribution is going to be more effective.Parkway
Shouldn't there be a check for null values in map? Otherwise, Math.Log(frequency) may return -INF.Jesselyn
(Math.Log(frequency) / Math.Log(2)) == Math.Log(frequency, 2)Reubenreuchlin
for a bytearray: public static double EntropyShannon(byte[] xs) { Dictionary<byte,int> map = new Dictionary<byte,int>(); foreach (byte x in xs) { if (!map.ContainsKey(x)) { map.Add(x,1); } else { map[x]++; } } double res = 0.0f; int len = xs.Length; foreach (KeyValuePair<byte,int> item in map) { double freq = (double)item.Value / len; res -= freq * (Math.Log(freq)/Math.Log(2)); } return res; }Shamefaced
So the way I understand it is that it calculates the information density in bits per byte; which can be at most 8 = -log_2(1/2^8) and is at least 0 = -log_2(1/1). Maybe this should be clarified.Rule
L
20

Is this something that ent could handle? (Or perhaps its not available on your platform.)

$ dd if=/dev/urandom of=file bs=1024 count=10
$ ent file
Entropy = 7.983185 bits per byte.
...

As a counter example, here is a file with no entropy.

$ dd if=/dev/zero of=file bs=1024 count=10
$ ent file
Entropy = 0.000000 bits per byte.
...
Logwood answered 13/6, 2009 at 20:58 Comment(2)
Thank you! Good to know this tool. But I need to solve this programmatically and in a platform-independent way, hence my question.Rubric
+1 Thanks for the pointer. This exists at least in Debian: packages.debian.org/wheezy/entJestinejesting
P
10

There's no such thing as the entropy of a file. In information theory, the entropy is a function of a random variable, not of a fixed data set (well, technically a fixed data set does have an entropy, but that entropy would be 0 — we can regard the data as a random distribution that has only one possible outcome with probability 1).

In order to calculate the entropy, you need a random variable with which to model your file. The entropy will then be the entropy of the distribution of that random variable. This entropy will equal the number of bits of information contained in that random variable.

Pejoration answered 13/6, 2009 at 20:47 Comment(7)
I'm not aware of the theoretical definition of entropy. But, there are always two semantics for every term: the theoretical one and the popular one. Well, seems that the popular part was understood by everyone here ;)Rubric
There are at least two obvious interpretations in the answers of how someone might translate "the entropy of a file" into a strict mathematical definition. If you really want to understand what you're doing, you should understand the statistical manner in which entropy is modeled in these answers.Trabeated
Or you could get into Kolmogorov complexity, which is a better mathematical definition but is uncomputable.Simonnesimonpure
@JamesThompson interesting, any pointers to how you would you go about deducing this random variable from a bunch of files you want to measure the entropy of?Bemba
I believe that in this problem the random variable are the bytes that are found in the file by running through it. So it will be a discrete random variable with 256 possible values, and its own distribution that depends on the file. (I know this post is old, but this might clarify anyone who gets here)Bullish
@Bullish Thank you. And I guess to help bridge the gap between the question, your answer, and the theoretical answer, we can note that the file could be considered to be composed of bits (random variables with 2 values) or nibbles or bytes or shorts or ints or utf-8 characters or blocks of size 1024, or indeed as strings the length of the file. So the entropy would vary for each model.Confiteor
Well, seems that the popular part was understood by everyone here ;) WRONG What actually happened here is that most people were either ignorant or ill-informed, and so they guessed and/or they assumed things. This is the real answer. If you know what you're doing well enough that the other answers are useful, you already knew the other answers. They are answers to different questions.Catarinacatarrh
R
5

If you use information theory entropy, mind that it might make sense not to use it on bytes. Say, if your data consists of floats you should instead fit a probability distribution to those floats and calculate the entropy of that distribution.

Or, if the contents of the file is unicode characters, you should use those, etc.

Rockie answered 13/6, 2009 at 11:33 Comment(6)
When I want to do data analysis for any kind of files, byte would be my best choice (as a compromise), I think.Rubric
Of course you can do so. However, you should use any additional information you can get. Otherwise your results can be extremly poor.Rockie
usuallyuseless is absolutely right. The Shannon entropy will not give you enough information about the file contents. Every compressor has 2 stages: modelling and entropy coding. The entropy coding is necessary, but most of the redundancy is detected on the modelling phase (unless you're doing with quasi-random data).Delaine
usuallyuseless is right here. One way to figure this out is to say in words the full thing that you're calculating: "what is the entropy of the ascii symbols that I'm using to represent my floating point numbers", is a thing you can calculate, but may not be what you're aiming for.Obverse
Java's capability of MIME tests is limited. It trusts the file extension, and only if the extension is either unknown, or not present, MIME type is guessed by looking inside the file. I am referring to URLConnection:getContentType().Rubric
This is commentary and not an answer.Brownson
O
2

Calculates entropy of any string of unsigned chars of size "length". This is basically a refactoring of the code found at http://rosettacode.org/wiki/Entropy. I use this for a 64 bit IV generator that creates a container of 100000000 IV's with no dupes and a average entropy of 3.9. http://www.quantifiedtechnologies.com/Programming.html

#include <string>
#include <map>
#include <algorithm>
#include <cmath>
typedef unsigned char uint8;

double Calculate(uint8 * input, int  length)
  {
  std::map<char, int> frequencies;
  for (int i = 0; i < length; ++i)
    frequencies[input[i]] ++;

  double infocontent = 0;
  for (std::pair<char, int> p : frequencies)
  {
    double freq = static_cast<double>(p.second) / length;
    infocontent += freq * log2(freq);
  }
  infocontent *= -1;
  return infocontent;
 }
Outstand answered 10/12, 2014 at 21:50 Comment(0)
C
2

Re: I need the whole thing to make assumptions on the file's contents: (plaintext, markup, compressed or some binary, ...)

As others have pointed out (or been confused/distracted by), I think you're actually talking about metric entropy (entropy divided by length of message). See more at Entropy (information theory) - Wikipedia.

jitter's comment linking to Scanning data for entropy anomalies is very relevant to your underlying goal. That links eventually to libdisorder (C library for measuring byte entropy). That approach would seem to give you lots more information to work with, since it shows how the metric entropy varies in different parts of the file. See e.g. this graph of how the entropy of a block of 256 consecutive bytes from a 4 MB jpg image (y axis) changes for different offsets (x axis). At the beginning and end the entropy is lower, as it part-way in, but it is about 7 bits per byte for most of the file.

enter image description here Source: https://github.com/cyphunk/entropy_examples. [Note that this and other graphs are available via the novel http://nonwhiteheterosexualmalelicense.org license....]

More interesting is the analysis and similar graphs at Analysing the byte entropy of a FAT formatted disk | GL.IB.LY

Statistics like the max, min, mode, and standard deviation of the metric entropy for the whole file and/or the first and last blocks of it might be very helpful as a signature.

This book also seems relevant: Detection and Recognition of File Masquerading for E-mail and Data Security - Springer

Confiteor answered 11/2, 2016 at 20:5 Comment(0)
L
0

Here's a Java algo based on this snippet and the invasion that took place during the infinity war

 public static double shannon_entropy(File file) throws IOException {
        byte[] bytes= Files.readAllBytes(file.toPath());//byte sequence
        int max_byte = 255;//max byte value
        int no_bytes = bytes.length;//file length
        int[] freq = new int[256];//byte frequencies
        for (int j = 0; j < no_bytes; j++) {
            int value = bytes[j] & 0xFF;//integer value of byte
            freq[value]++;
        }
        double entropy = 0.0;
        for (int i = 0; i <= max_byte; i++) {
            double p = 1.0 * freq[i] / no_bytes;
            if (freq[i] > 0)
                entropy -= p * Math.log(p) / Math.log(2);
        }
       return entropy;
    }

usage-example:

 File file=new File("C:\\Users\\Somewhere\\In\\The\\Omniverse\\Thanos Invasion.Log");
 int file_length=(int)file.length();
 double shannon_entropy=shannon_entropy(file);
 System.out.println("file length: "+file_length+" bytes");
 System.out.println("shannon entropy: "+shannon_entropy+" nats i.e. a minimum of "+shannon_entropy+" bits can be used to encode each byte transfer" +
                    "\nfrom the file so that in total we transfer atleast "+(file_length*shannon_entropy)+" bits ("+((file_length*shannon_entropy)/8D)+
                    " bytes instead of "+file_length+" bytes).");

output-example:

file length: 5412 bytes

shannon entropy: 4.537883805240875 nats i.e. a minimum of 4.537883805240875 bits can be used to encode each byte transfer from the file so that in total we transfer atleast 24559.027153963616 bits (3069.878394245452 bytes instead of 5412 bytes).

Leolaleoline answered 18/6, 2021 at 8:35 Comment(0)
C
-1

Without any additional information entropy of a file is (by definition) equal to its size*8 bits. Entropy of text file is roughly size*6.6 bits, given that:

  • each character is equally probable
  • there are 95 printable characters in byte
  • log(95)/log(2) = 6.6

Entropy of text file in English is estimated to be around 0.6 to 1.3 bits per character (as explained here).

In general you cannot talk about entropy of a given file. Entropy is a property of a set of files.

If you need an entropy (or entropy per byte, to be exact) the best way is to compress it using gzip, bz2, rar or any other strong compression, and then divide compressed size by uncompressed size. It would be a great estimate of entropy.

Calculating entropy byte by byte as Nick Dandoulakis suggested gives a very poor estimate, because it assumes every byte is independent. In text files, for example, it is much more probable to have a small letter after a letter than a whitespace or punctuation after a letter, since words typically are longer than 2 characters. So probability of next character being in a-z range is correlated with value of previous character. Don't use Nick's rough estimate for any real data, use gzip compression ratio instead.

Celisse answered 31/12, 2013 at 13:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.