Compute entropy of different file extensions to find randomness of data?
Asked Answered
M

1

1

I have different file types which includes JPEG and jpg ,#mp3 ,#GIF,#MP4,#FLV, M4V ,exe, zip etc.

  1. Take data in blocks , something like 4k block size, find randomness

  2. Generate randomness score between 0 and 1

  3. Try to put classes according to the randomness score.

How can we find entropy of different types of files as mentioned above and can we scale each of the file score between 0 and 1.

Malign answered 11/11, 2020 at 6:3 Comment(0)
T
2

+1 very interesting question here few untested ideas just from top of my head right now:

  1. how about using correlation coefficient

    with uniformly random data of the same size (4K) ? and or using FFT first and then correlate ...

  2. Or compute statistical properties and infer from there somehow ...

    I know this is vague description but might be worth digging into it...

  3. use comprimation

    for example compress your 4K data with Huffman coding and infere your coefficient form the ratio between uncompressed and compressed sizes and may be use logarithmic scale ...

I think the most easy to implement and plausible results will be from 3th approach as Huffman coding and entropy are closely related.

[edit1] using Shannon information entropy

Your suggestion is even better than Hufman encoding (even those 2 are closely related). Using Shannon information entropy H with binary digits as base will return the average amount of bits per word of your data (after Hufman encoding) needed to represent your data. So from there going to score <0..1> just divide by bits per word...

Here small C++/VCL example for entropy computation on BYTEs:

//$$---- Form CPP ----
//---------------------------------------------------------------------------
#include <vcl.h>
#include <math.h>
#pragma hdrstop
#include "Unit1.h"
//---------------------------------------------------------------------------
#pragma package(smart_init)
#pragma resource "*.dfm"
TForm1 *Form1;
//---------------------------------------------------------------------------
char txt0[]="text Text bla bla bla ..."; 
char txt1[]="abcdefghijklmnopqrstuvwxy";
char txt2[]="AAAAAAAbbbbbccccddddeeeff";
//---------------------------------------------------------------------------
double entropy(BYTE *dat,int siz)
    {
    if ((siz<=0)||(dat==NULL)) return 0.0;
    int i; double H=0.0,P[256],dp=1.0/siz,base=1.0/log(2.0);
    for (i=0;i<256;i++) P[i]=0.0;
    for (i=0;i<siz;i++) P[dat[i]]+=dp;
    for (i=0;i<256;i++)
        {
        if (P[i]==0.0) continue;    // skip non existing items
        if (P[i]==1.0) return 0.0;  // this means only 1 item type is present in data
        H-=P[i]*log(P[i])*base;     // shanon entropy (binary bits base)
        }
    return H;
    }
//---------------------------------------------------------------------------
__fastcall TForm1::TForm1(TComponent* Owner):TForm(Owner)
    {
    mm_log->Lines->Clear();
    mm_log->Lines->Add(AnsiString().sprintf("txt = \"%s\" , H = %.6lf , H/8 = %.6lf",txt0,entropy(txt0,sizeof(txt0)),entropy(txt0,sizeof(txt0))/8.0));
    mm_log->Lines->Add(AnsiString().sprintf("txt = \"%s\" , H = %.6lf , H/8 = %.6lf",txt1,entropy(txt1,sizeof(txt1)),entropy(txt1,sizeof(txt1))/8.0));
    mm_log->Lines->Add(AnsiString().sprintf("txt = \"%s\" , H = %.6lf , H/8 = %.6lf",txt2,entropy(txt2,sizeof(txt2)),entropy(txt2,sizeof(txt2))/8.0));
    }
//-------------------------------------------------------------------------

And results:

txt = "text Text bla bla bla ..." , H = 3.185667 , H/8 = 0.398208
txt = "abcdefghijklmnopqrstuvwxy" , H = 4.700440 , H/8 = 0.587555
txt = "AAAAAAAbbbbbccccddddeeeff" , H = 2.622901 , H/8 = 0.327863
Topazolite answered 11/11, 2020 at 11:17 Comment(5)
what about using shannon entropy to generate randomness score?Malign
@AbhaySingh I completely forgot about that one Yes its a very good start point I added [edit1] ...Topazolite
for this, firstly we have to convert the file to byte array, and then we can use this formula?Malign
sorry , means using the byte array to count the frequencies of each byte valueMalign
@AbhaySingh yes so just 256 counters for the histogram ... if you are procesing BYTEs ... however you can proces also floats, doubles, WORDs, DWORDs ... I think you should try all of them and chose the smallest score ... however that would be slow or require too big arraysTopazolite

© 2022 - 2024 — McMap. All rights reserved.