"uncompressable" data sequence

Asked 7/2, 2012 at 22:55 Answered 12/6, 2020 at 14:0

I would like to generate an "uncompressable" data sequence of X MBytes through an algorithm. I want it that way in order to create a program that measures the network speed through VPN connection (avoiding vpn built-in compression).

Can anybody help me? Thanks!

PS. I need an algorithm, I have used a file compressed to the point that cannot be compressed anymore, but now I need to generate the data sequence from scratch programatically.

Bellabelladonna answered 7/2, 2012 at 22:55 Comment(2)

A random sequence of bytes is incompressible. So get a good random source and pull out whatever data size you need – Belligerent 7/2, 2012 at 23:8

Are you targeting a specific compression algorithm? Compression algorithms generally have a finite frame size within which they compress. E.g. the reference gzip implementation maxes out at 32KB , so you can repeat the same 32KB of random data to generate an arbitrarily large uncompressable stream. – Liturgics 10/8, 2012 at 23:30

White noise data is truly random and thus incompressible.

Therefore, you should find an algorithm that generates it (or an approximation).

Try this in Linux:

# dd if=/dev/urandom bs=1024 count=10000 2>/dev/null | bzip2 -9 -c -v > /dev/null
(stdin): 0.996:1, 8.035 bits/byte, -0.44% saved, 10240000 in, 10285383 out.

You might try any kind of random number generation though...

Fibrinolysin answered 7/2, 2012 at 23:8 Comment(1)

Just for clarity. The above shows that you can generate a chunk of data that is incompressible; compressing it actually makes it bigger as evidenced by in and out... – Fibrinolysin 8/2, 2012 at 13:8

One simple approach to creating statistically hard-to-compress data is just to use a random number generator. If you need it to be repeatable, fix the seed. Any reasonably good random number generator will do. Ironically, the result is incredibly compressible if you know the random number generator: the only information present is the seed. However, it will defeat any real compression method.

Unrobe answered 7/2, 2012 at 23:7 Comment(0)

Other answers have pointed out that random noise is incompressible, and good encryption functions have output that is as close as possible to random noise (unless you know the decryption key). So a good approach could be to just use random number generators or encryption algorithms to generate your incompressible data.

Genuinely incompressible (by any compression algorithm) bitstrings exist (for certain formal definitions of "incompressible"), but even recognising them is computationally undecidable, let alone generating them.

It's worth pointing out though that "random data" is only incompressible in that there is no compression algorithm that can achieve a compression ratio of better than 1:1 on average over all possible random data. However, for any particular randomly generated string, there may be a particular compression algorithm that does achieve a good compression ratio. After all, any compressible string should be possible output from a random generator, including stupid things like all zeroes, however unlikely.

So while the possibility of getting "compressible" data out of a random number generator or an encryption algorithm is probably vanishingly small, I would want to actually test the data before I use it. If you have access to the compression algorithm(s) used in the VPN connection that would be best; just randomly generate data until you get something that won't compress. Otherwise, just running it through a few common compression tools and checking that the size doesn't decrease would probably be sufficient.

Dariadarian answered 7/2, 2012 at 23:27 Comment(0)

You have a couple of options: 1. Use a decent pseudo-random number generator 2. Use an encryption function like AES (implementations found everywhere)

Algo

Come up with whatever key you want. All zeroes is fine.
Create an empty block
Encrypt the block using the key
Output the block
If you need more data, goto 3

If done correctly, the datastream you generate will be mathematically indistinguishable from random noise.

Bor answered 7/2, 2012 at 23:8 Comment(2)

Extra idea: To test your algorithm (whatever you choose): - Let it run and generate about 100MB or so. - Try compressing it zip, rar, etc... – Bor 11/2, 2012 at 0:0

This was the idea for my answer. Hardware accelerated AES (aes-ni) is very fast but of course we an do better if the goal is just incompressiblity. – Pollination 28/2, 2013 at 2:22

The following program (C/POSIX) produces incompressible data quickly, it should be in the gigabytes per second range. I'm sure it's possible to use the general idea to make it even faster (maybe using Djb's ChaCha core with SIMD?).

/* public domain, 2013 */

#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>

#define R(a,b) (((a) << (b)) | ((a) >> (32 - (b))))
static void salsa_scrambler(uint32_t out[16], uint32_t x[16])
{
    int i;
    /* This is a quickly mutilated Salsa20 of only 1 round */
    x[ 4] ^= R(x[ 0] + x[12],  7);
    x[ 8] ^= R(x[ 4] + x[ 0],  9);
    x[12] ^= R(x[ 8] + x[ 4], 13);
    x[ 0] ^= R(x[12] + x[ 8], 18);
    x[ 9] ^= R(x[ 5] + x[ 1],  7);
    x[13] ^= R(x[ 9] + x[ 5],  9);
    x[ 1] ^= R(x[13] + x[ 9], 13);
    x[ 5] ^= R(x[ 1] + x[13], 18);
    x[14] ^= R(x[10] + x[ 6],  7);
    x[ 2] ^= R(x[14] + x[10],  9);
    x[ 6] ^= R(x[ 2] + x[14], 13);
    x[10] ^= R(x[ 6] + x[ 2], 18);
    x[ 3] ^= R(x[15] + x[11],  7);
    x[ 7] ^= R(x[ 3] + x[15],  9);
    x[11] ^= R(x[ 7] + x[ 3], 13);
    x[15] ^= R(x[11] + x[ 7], 18);
    for (i = 0; i < 16; ++i)
        out[i] = x[i];
}

#define CHUNK 2048

int main(void)
{
    uint32_t bufA[CHUNK];
    uint32_t bufB[CHUNK];
    uint32_t *input = bufA, *output = bufB;
    int i;

    /* Initialize seed */
    srand(time(NULL));
    for (i = 0; i < CHUNK; i++)
        input[i] = rand();

    while (1) {
        for (i = 0; i < CHUNK/16; i++) {
            salsa_scrambler(output + 16*i, input + 16*i);
        }
        write(1, output, sizeof(bufA));

        {
            uint32_t *tmp = output;
            output = input;
            input = tmp;
        }
    }
    return 0;
}

Pollination answered 28/2, 2013 at 2:7 Comment(0)

A very simple solution is to generate a random string and then compress it. An already compressed file is incompressible.

Milliliter answered 29/4, 2015 at 19:24 Comment(2)

Down voter: This approach has been used in a project. What's wrong with it? – Milliliter 21/7, 2016 at 14:43

Compressing a string doesn't mean it cannot be compressed further. Some compression methods use multiple algorithms after each other. – Elseelset 3/10, 2019 at 10:33

For copy-paste lovers here some C# code to generate files with (almost) uncompressable content. The heart of the code is the MD5 hashing algorithm but any cryptographically strong (good random distribution in final result) hash algorithm does the job (SHA1, SHA256, etc).

It just use the file number bytes (32 bit little endian signed integer in my machine) as an hash function's initial input and reshashes and concatenates the output until the desired file size reached. So the file content is deterministic (same number always generates same output) randomly distributed "junk" for the compression algorithm under test.

    using System;
    using System.IO;
    using System.Linq;
    using System.Security.Cryptography;

    class Program {
    static void Main( string [ ] args ) {

        GenerateUncompressableTestFiles(
            outputDirectory  : Path.GetFullPath( "." ),
            fileNameTemplate : "test-file-{0}.dat", 
            fileCount        : 10,
            fileSizeAsBytes  : 16 * 1024
        );

        byte[] bytes = GetIncompressibleBuffer( 16 * 1024 );

    }//Main

    static void GenerateUncompressableTestFiles( string outputDirectory, string  fileNameTemplate, int fileCount, int fileSizeAsBytes ) {

       using ( var md5 = MD5.Create() ) {

          for ( int number = 1; number <= fileCount; number++ ) {

              using ( var content = new MemoryStream() ) {

                    var inputBytes = BitConverter.GetBytes( number );

                    while ( content.Length <= fileSizeAsBytes ) {

                        var hashBytes = md5.ComputeHash( inputBytes );
                        content.Write( hashBytes );
                        inputBytes = hashBytes;

                        if ( content.Length >= fileSizeAsBytes ) {
                            var file = Path.Combine( outputDirectory, String.Format( fileNameTemplate, number ) );
                            File.WriteAllBytes( file, content.ToArray().Take( fileSizeAsBytes ).ToArray() );
                        }

                    }//while

               }//using

            }//for

       }//using

    }//GenerateUncompressableTestFiles

    public static byte[] GetIncompressibleBuffer( int size, int seed = 0 ) { 

       using ( var md5 = MD5.Create() ) {

            using ( var content = new MemoryStream() ) {

                var inputBytes = BitConverter.GetBytes( seed );

                while ( content.Length <= size ) {

                    var hashBytes = md5.ComputeHash( inputBytes );
                    content.Write( hashBytes );
                    inputBytes = hashBytes;

                    if ( content.Length >= size ) {
                        return content.ToArray().Take( size ).ToArray();
                    }

                }//while

            }//using

        }//using

        return Array.Empty<byte>();

    }//GetIncompressibleBuffer 


    }//class

Suppress answered 12/6, 2020 at 14:0 Comment(0)

-1

I just created a (very simple and not optimized) C# console application that creates uncompressable files. It scans a folder for textfiles (extension .txt) and creates a binary file (extension .bin) with the same name and size for each textfile. Hope this helps someone. Here is the C# code:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            var files = Directory.EnumerateFiles(@"d:\MyPath\To\TextFile\", "*.txt");
            var random = new Random();
            foreach (var fileName in files)
            {
                var fileInfo = new FileInfo(fileName);
                var newFileName = Path.GetDirectoryName(fileName) + @"\" + Path.GetFileNameWithoutExtension(fileName) + ".bin";
                using (var f = File.Create(newFileName))
                {
                    long bytesWritten = 0;
                    while (bytesWritten < fileInfo.Length)
                    {
                        f.WriteByte((byte)random.Next());
                        bytesWritten++;
                    }
                    f.Close();
                }
            }
        }
    }
}

Rae answered 2/10, 2013 at 8:55 Comment(0)

Recommended topics

Hot tags