Estimating max payload size on a compressed list of integers
Asked Answered
O

3

5

I have 1 million rows in an application. It makes a request to a server such as the following:

/search?q=hello

And the search returns a sorted list of integers, representing the rows that have been matched in the input dataset (which the user has in their browser). How would I go about estimating the maximum size that the payload would return? For example, to start we have:

# ~7 MB if we stored "all results" uncompressed
6888887

# ~ 3.5MB if we stored "all results" relative to 0 or ALL matches (cuts it down by two)
3444443

And then we would want to compress these integers using some sort of decompression (Elias-Fano ?) What would be the "worst-case" scenario for the size of 1M sorted integers? And how would that calculation be made?

The application has one million rows of data, so lets say R1 --> R1000000, or if zero-indexing, range(int(1e6)). The server will respond with something like: [1,2,3], indicating that (only) rows 1, 2, and 3 were matched.

Octahedral answered 14/9, 2019 at 3:42 Comment(8)
What's the nature of the row number identifiers? Just numbers between 0 and N-1 (with N is the number of records) with none missing? Or any 32bit (or whatever other fixed size) integers (say after a lot of modifications of the DB)? | (How did you arrive at the ~7MB?)Gate
Since you didn't respond, I looked at the edit history and it answered my questions. The numbers are in range [0-999999] (although you seem to count from 1, but maybe that's a mistake? not much of a difference anyway). | You seem to have chosen to use ascii representation of the numbers in decimal format with comma separators. As your measurement shows, that's a rather inefficient approach, as you need on average 6.89 bytes to represent one number. Even a naive binary array of 32bit integers would need 4 bytes per entry. But you only need 20 bits to represent numbers under 1000000 (2.5 bytes)Gate
However, the relevant information we need to store is only whether each of the numbers in this fixed range is present or not. That means one bit per entry, so 1 million bits (125000 bytes). This can be the worst-case size. | You could do some entropy coding on the resulting bitstream to reduce the size further. | For cases with 50000 matches or fewer you could switch to just encoding the matches (using fixed width 20bits per integer becomes smaller than the bitmap then). | To compress further, sort in reverse and use range reduction to code only the significant bits.Gate
@DanMašek thanks for these comments. Yes, for a very crude start I just did range(1000000) and json.dumps() in python. I was using one-indexed row numbers (like excel) instead of normal array indexes. Would you want to write an answer on how you'd suggest encoding and compressing this? I'll add a bounty to the question as well!Octahedral
This seems a bit broad. It also doesn't seem like C.Hustle
@Octahedral Sure, but it will take a bit of time to make a good enough answer to my standards (and any proof of concept code will be C++). I've thought about it some more and made some trials, and the bitmap with an entropy coder that can handle fractional bits (range, arithmetic, fse -- one that can get close to entropy) is hard to beat for any number of matches. Huffmann with grouped symbols could get close as well. I'll try to cover some other options. However, it might be good enough to pass the bitmap through some existing lightweight compressor like zlib/snappy/lz4...Gate
@DanMašek perfect, yes C or C++ would be great!Octahedral
@DanMašek any update?Octahedral
O
4

Input specification:

  • Row numbers between 0 and 999999 (if you need 1-indexing, you can apply offset)
  • Each row number appears only once
  • The numbers are sorted in ascending order (useful, we'd want to sort them anyway)

A great idea you've had was to invert the meaning of the result when the number of matches is more than half the possible values. Let us retain that, and assume we are given a flag and a list of matches/misses.


Your initial attempt at coding this encoded the numbers as text with comma separation. That means that for 90% of the possible values you need 6 characters + 1 separator -- so 7 bytes on average. However, since the maximum value is 999999, you really only need 20 bits to encode each entry.

Hence, the first idea to reducing the size is to use binary encoding.


Binary Encoding

The simplest approach is to write the number of values sent followed by a stream of 32bit integers.

A more efficient approach would be to pack two 20-bit values into each 5 bytes written. In case of an odd count, you would just pad the 4 excess bits with zeros.

Those approaches may be good for small amounts of matches (or misses). However, the important thing to note is that for each row, we only need to track 1 bit of information -- whether it's present or not. That means that we can encode the results as a bitmap of 1000000 bits.

Combining those two approaches, we can use a bitmap when there are many matches or misses, and switch to binary coding when it's more efficient.


Range Reduction

The next potential improvement to use when coding sorted sequences of integers is to use range reduction.

The idea is to code the values from largest to smallest, and reducing the number of bits per value as they get smaller.

  • First, we encode the number of bits N necessary to represent the first value.
  • We encode the first value using N bits
  • For each following value
    • Encode the value using N bits
    • If the value requires fewer bits to encode, reduce N appropriately

Entropy Coding

Let's go back to the bitmap encoding. Based on the Shannon entropy theory the worst case is where we have 50% matches. The further the probabilities are skewed, the fewer bits we need on average to code each entry.

Matches | Bits
--------+-----------
0       | 0
1       | 22
2       | 41
3       | 60
4       | 78
5       | 96
10      | 181
100     | 1474
1000    | 11408
10000   | 80794
100000  | 468996
250000  | 811279
500000  | 1000000 

To do this, we need to use an entropy coder that can code fractional bits -- something like arithmetic or range coder or some of the new ANS based coders like FSE. Alternatively, we could group symbols together and use Huffman coding.


Prototypes and Measurements

I've written a test using a 32-bit implementation of FastAC by Amir Said, which limits the model to 4 decimal places. (This is not really a problem, since we shouldn't be feeding such data to the codec directly. This is just a demonstration.)

First some common code:

typedef std::vector<uint8_t> match_symbols_t;
typedef std::vector<uint32_t> match_list_t;
typedef std::set<uint32_t> match_set_t;
typedef std::vector<uint8_t> buffer_t;
// ----------------------------------------------------------------------------
static uint32_t const NUM_VALUES(1000000);
// ============================================================================
size_t symbol_count(uint8_t bits)
{
    size_t count(NUM_VALUES / bits);
    if (NUM_VALUES % bits > 0) {
        return count + 1;
    }
    return count;
}
// ----------------------------------------------------------------------------
void set_symbol(match_symbols_t& symbols, uint8_t bits, uint32_t match, bool state)
{
    size_t index(match / bits);
    size_t offset(match % bits);
    if (state) {
        symbols[index] |= 1 << offset;
    } else {
        symbols[index] &= ~(1 << offset);
    }
}
// ----------------------------------------------------------------------------
bool get_symbol(match_symbols_t const& symbols, uint8_t bits, uint32_t match)
{
    size_t index(match / bits);
    size_t offset(match % bits);
    return (symbols[index] & (1 << offset)) != 0;
}
// ----------------------------------------------------------------------------
match_symbols_t make_symbols(match_list_t const& matches, uint8_t bits)
{
    assert((bits > 0) && (bits <= 8));

    match_symbols_t symbols(symbol_count(bits), 0);
    for (auto match : matches) {
        set_symbol(symbols, bits, match, true);
    }

    return symbols;
}
// ----------------------------------------------------------------------------
match_list_t make_matches(match_symbols_t const& symbols, uint8_t bits)
{
    match_list_t result;
    for (uint32_t i(0); i < 1000000; ++i) {
        if (get_symbol(symbols, bits, i)) {
            result.push_back(i);
        }
    }
    return result;
}

First, simpler variant is to write the number of matches, determine the probability of match/miss and clamp it to the supported range. Then simply encode each value of the bitmap using this static probability model.

class arithmetic_codec_v1
{
public:
    buffer_t compress(match_list_t const& matches)
    {
        uint32_t match_count(static_cast<uint32_t>(matches.size()));

        arithmetic_codec codec(static_cast<uint32_t>(NUM_VALUES / 4));
        codec.start_encoder();

        // Store the number of matches (1000000 needs only 20 bits)
        codec.put_bits(match_count, 20);

        if (match_count > 0) {
            // Initialize the model
            static_bit_model model;
            model.set_probability_0(get_probability_0(match_count));

            // Create a bitmap and code all the bitmap entries
            // NB: This is lazy and inefficient, but simple
            match_symbols_t symbols = make_symbols(matches, 1);
            for (auto entry : symbols) {
                codec.encode(entry, model);
            }
        }

        uint32_t compressed_size = codec.stop_encoder();
        return buffer_t(codec.buffer(), codec.buffer() + compressed_size);
    }

    match_list_t decompress(buffer_t& compressed)
    {
        arithmetic_codec codec(static_cast<uint32_t>(compressed.size()), &compressed[0]);
        codec.start_decoder();

        // Read number of matches (20 bits)
        uint32_t match_count(codec.get_bits(20));

        match_list_t result;
        if (match_count > 0) {
            static_bit_model model;
            model.set_probability_0(get_probability_0(match_count));

            result.reserve(match_count);
            for (uint32_t i(0); i < NUM_VALUES; ++i) {
                uint32_t entry = codec.decode(model);
                if (entry == 1) {
                    result.push_back(i);
                }
            }
        }

        codec.stop_decoder();
        return result;
    }

private:
    double get_probability_0(uint32_t match_count, uint32_t num_values = NUM_VALUES)
    {
        double probability_0(double(num_values - match_count) / num_values);
        // Limit probability to match FastAC limitations...
        return std::max(0.0001, std::min(0.9999, probability_0));
    }
};

The second approach is to adapt the model based on the symbols we code. After each match is encoded, reduce the probability of the next match. Once all matches we coded, stop.

The second variation compresses slightly better, but at a noticeable performance cost.

class arithmetic_codec_v2
{
public:
    buffer_t compress(match_list_t const& matches)
    {
        uint32_t match_count(static_cast<uint32_t>(matches.size()));
        uint32_t total_count(NUM_VALUES);

        arithmetic_codec codec(static_cast<uint32_t>(NUM_VALUES / 4));
        codec.start_encoder();

        // Store the number of matches (1000000 needs only 20 bits)
        codec.put_bits(match_count, 20);

        if (match_count > 0) {
            static_bit_model model;

            // Create a bitmap and code all the bitmap entries
            // NB: This is lazy and inefficient, but simple
            match_symbols_t symbols = make_symbols(matches, 1);
            for (auto entry : symbols) {
                model.set_probability_0(get_probability_0(match_count, total_count));
                codec.encode(entry, model);
                --total_count;
                if (entry) {
                    --match_count;
                }
                if (match_count == 0) {
                    break;
                }
            }
        }

        uint32_t compressed_size = codec.stop_encoder();
        return buffer_t(codec.buffer(), codec.buffer() + compressed_size);
    }

    match_list_t decompress(buffer_t& compressed)
    {
        arithmetic_codec codec(static_cast<uint32_t>(compressed.size()), &compressed[0]);
        codec.start_decoder();

        // Read number of matches (20 bits)
        uint32_t match_count(codec.get_bits(20));
        uint32_t total_count(NUM_VALUES);

        match_list_t result;
        if (match_count > 0) {
            static_bit_model model;
            result.reserve(match_count);
            for (uint32_t i(0); i < NUM_VALUES; ++i) {
                model.set_probability_0(get_probability_0(match_count, NUM_VALUES - i));
                if (codec.decode(model) == 1) {
                    result.push_back(i);
                    --match_count;
                }
                if (match_count == 0) {
                    break;
                }
            }
        }

        codec.stop_decoder();
        return result;
    }

private:
    double get_probability_0(uint32_t match_count, uint32_t num_values = NUM_VALUES)
    {
        double probability_0(double(num_values - match_count) / num_values);
        // Limit probability to match FastAC limitations...
        return std::max(0.0001, std::min(0.9999, probability_0));
    }
};


Practical Approach

Practically, it's probalby not worth designing a new compression format. In fact, it might not even be worth it writing the results as bits, just make an array of bytes with values 0 or 1. Then use an existing compression library -- zlib is very common, or you could try lz4 or snappy, bzip2, lzma... the choices are plentiful.

ZLib Example

class zlib_codec
{
public:
    zlib_codec(uint32_t bits_per_symbol) : bits_per_symbol(bits_per_symbol) {}

    buffer_t compress(match_list_t const& matches)
    {
        match_symbols_t symbols(make_symbols(matches, bits_per_symbol));

        z_stream defstream;
        defstream.zalloc = nullptr;
        defstream.zfree = nullptr;
        defstream.opaque = nullptr;

        deflateInit(&defstream, Z_BEST_COMPRESSION);
        size_t max_compress_size = deflateBound(&defstream, static_cast<uLong>(symbols.size()));

        buffer_t compressed(max_compress_size);

        defstream.avail_in = static_cast<uInt>(symbols.size());
        defstream.next_in = &symbols[0];
        defstream.avail_out = static_cast<uInt>(max_compress_size);
        defstream.next_out = &compressed[0];

        deflate(&defstream, Z_FINISH);
        deflateEnd(&defstream);

        compressed.resize(defstream.total_out);
        return compressed;
    }

    match_list_t decompress(buffer_t& compressed)
    {
        z_stream infstream;
        infstream.zalloc = nullptr;
        infstream.zfree = nullptr;
        infstream.opaque = nullptr;

        inflateInit(&infstream);

        match_symbols_t symbols(symbol_count(bits_per_symbol));

        infstream.avail_in = static_cast<uInt>(compressed.size());
        infstream.next_in = &compressed[0];
        infstream.avail_out = static_cast<uInt>(symbols.size());
        infstream.next_out = &symbols[0];

        inflate(&infstream, Z_FINISH);
        inflateEnd(&infstream);

        return make_matches(symbols, bits_per_symbol);
    }
private:
    uint32_t bits_per_symbol;
};

BZip2 Example

class bzip2_codec
{
public:
    bzip2_codec(uint32_t bits_per_symbol) : bits_per_symbol(bits_per_symbol) {}

    buffer_t compress(match_list_t const& matches)
    {
        match_symbols_t symbols(make_symbols(matches, bits_per_symbol));

        uint32_t compressed_size = symbols.size() * 2;
        buffer_t compressed(compressed_size);

        int err = BZ2_bzBuffToBuffCompress((char*)&compressed[0]
            , &compressed_size
            , (char*)&symbols[0]
            , symbols.size()
            , 9
            , 0
            , 30);
        if (err != BZ_OK) {
            throw std::runtime_error("Compression error.");
        }

        compressed.resize(compressed_size);
        return compressed;
    }

    match_list_t decompress(buffer_t& compressed)
    {
        match_symbols_t symbols(symbol_count(bits_per_symbol));

        uint32_t decompressed_size = symbols.size();
        int err = BZ2_bzBuffToBuffDecompress((char*)&symbols[0]
            , &decompressed_size
            , (char*)&compressed[0]
            , compressed.size()
            , 0
            , 0);
        if (err != BZ_OK) {
            throw std::runtime_error("Compression error.");
        }
        if (decompressed_size != symbols.size()) {
            throw std::runtime_error("Size mismatch.");
        }

        return make_matches(symbols, bits_per_symbol);
    }
private:
    uint32_t bits_per_symbol;
};


Comparison


Code repository, including dependencies for 64bit Visual Studio 2015 is at https://github.com/dan-masek/bounded_sorted_list_compression.git

Obovate answered 24/9, 2019 at 1:34 Comment(2)
wow, what a great answer, this is fantastic -- hopefully others will find it useful as well!Octahedral
@Octahedral Thanks :) One more thing came to mind (based on a little coding competition we did almost a decade ago...) -- using a bitmap is a nice O(N) method to sort a list of unique bounded integers. If you're going to generate the bitmap before encoding, you don't even need the input sorted. | It might be useful to make some demonstration of coding the 20-bit integers or range reduction and doing some measurements on that in raw or compressed with general compressor. Might muck about with that when i have some time.Gate
N
5

There are 2^(10^6) different sorted (duplicate-less) lists of integers < 10^6. Mapping each such list, say [0, 4, ...], to the corresponding bit array (say 10001....) yields 10^6 bits, i.e 125kB of information. As each bit array corresponds to a unique possible sorted list, and vice versa, this is the most compact (in the sense of: having the smallest max. size) representation.

Of course, if some results are more probable than others, there may be more efficient (in the sense of: having a smaller average size) representations. For example, if most result sets are small, a simple run-length encoding may generally yield smaller encodings.

Inevitably, in that case the maximal size of the encoding (the max. payload size you were asking about) will be more than 125 kB

Compressing the above-mentioned 125 kB bit array with e.g. zlib will yield an acceptably compact encoding for small result sets. Moreover, zlib has a function deflateBound() that, given the uncompressed size, will calculate the max payload size (which, in your case, will definitely be larger than 125 kB, but not by much)

Newcomb answered 22/9, 2019 at 7:19 Comment(0)
O
4

Input specification:

  • Row numbers between 0 and 999999 (if you need 1-indexing, you can apply offset)
  • Each row number appears only once
  • The numbers are sorted in ascending order (useful, we'd want to sort them anyway)

A great idea you've had was to invert the meaning of the result when the number of matches is more than half the possible values. Let us retain that, and assume we are given a flag and a list of matches/misses.


Your initial attempt at coding this encoded the numbers as text with comma separation. That means that for 90% of the possible values you need 6 characters + 1 separator -- so 7 bytes on average. However, since the maximum value is 999999, you really only need 20 bits to encode each entry.

Hence, the first idea to reducing the size is to use binary encoding.


Binary Encoding

The simplest approach is to write the number of values sent followed by a stream of 32bit integers.

A more efficient approach would be to pack two 20-bit values into each 5 bytes written. In case of an odd count, you would just pad the 4 excess bits with zeros.

Those approaches may be good for small amounts of matches (or misses). However, the important thing to note is that for each row, we only need to track 1 bit of information -- whether it's present or not. That means that we can encode the results as a bitmap of 1000000 bits.

Combining those two approaches, we can use a bitmap when there are many matches or misses, and switch to binary coding when it's more efficient.


Range Reduction

The next potential improvement to use when coding sorted sequences of integers is to use range reduction.

The idea is to code the values from largest to smallest, and reducing the number of bits per value as they get smaller.

  • First, we encode the number of bits N necessary to represent the first value.
  • We encode the first value using N bits
  • For each following value
    • Encode the value using N bits
    • If the value requires fewer bits to encode, reduce N appropriately

Entropy Coding

Let's go back to the bitmap encoding. Based on the Shannon entropy theory the worst case is where we have 50% matches. The further the probabilities are skewed, the fewer bits we need on average to code each entry.

Matches | Bits
--------+-----------
0       | 0
1       | 22
2       | 41
3       | 60
4       | 78
5       | 96
10      | 181
100     | 1474
1000    | 11408
10000   | 80794
100000  | 468996
250000  | 811279
500000  | 1000000 

To do this, we need to use an entropy coder that can code fractional bits -- something like arithmetic or range coder or some of the new ANS based coders like FSE. Alternatively, we could group symbols together and use Huffman coding.


Prototypes and Measurements

I've written a test using a 32-bit implementation of FastAC by Amir Said, which limits the model to 4 decimal places. (This is not really a problem, since we shouldn't be feeding such data to the codec directly. This is just a demonstration.)

First some common code:

typedef std::vector<uint8_t> match_symbols_t;
typedef std::vector<uint32_t> match_list_t;
typedef std::set<uint32_t> match_set_t;
typedef std::vector<uint8_t> buffer_t;
// ----------------------------------------------------------------------------
static uint32_t const NUM_VALUES(1000000);
// ============================================================================
size_t symbol_count(uint8_t bits)
{
    size_t count(NUM_VALUES / bits);
    if (NUM_VALUES % bits > 0) {
        return count + 1;
    }
    return count;
}
// ----------------------------------------------------------------------------
void set_symbol(match_symbols_t& symbols, uint8_t bits, uint32_t match, bool state)
{
    size_t index(match / bits);
    size_t offset(match % bits);
    if (state) {
        symbols[index] |= 1 << offset;
    } else {
        symbols[index] &= ~(1 << offset);
    }
}
// ----------------------------------------------------------------------------
bool get_symbol(match_symbols_t const& symbols, uint8_t bits, uint32_t match)
{
    size_t index(match / bits);
    size_t offset(match % bits);
    return (symbols[index] & (1 << offset)) != 0;
}
// ----------------------------------------------------------------------------
match_symbols_t make_symbols(match_list_t const& matches, uint8_t bits)
{
    assert((bits > 0) && (bits <= 8));

    match_symbols_t symbols(symbol_count(bits), 0);
    for (auto match : matches) {
        set_symbol(symbols, bits, match, true);
    }

    return symbols;
}
// ----------------------------------------------------------------------------
match_list_t make_matches(match_symbols_t const& symbols, uint8_t bits)
{
    match_list_t result;
    for (uint32_t i(0); i < 1000000; ++i) {
        if (get_symbol(symbols, bits, i)) {
            result.push_back(i);
        }
    }
    return result;
}

First, simpler variant is to write the number of matches, determine the probability of match/miss and clamp it to the supported range. Then simply encode each value of the bitmap using this static probability model.

class arithmetic_codec_v1
{
public:
    buffer_t compress(match_list_t const& matches)
    {
        uint32_t match_count(static_cast<uint32_t>(matches.size()));

        arithmetic_codec codec(static_cast<uint32_t>(NUM_VALUES / 4));
        codec.start_encoder();

        // Store the number of matches (1000000 needs only 20 bits)
        codec.put_bits(match_count, 20);

        if (match_count > 0) {
            // Initialize the model
            static_bit_model model;
            model.set_probability_0(get_probability_0(match_count));

            // Create a bitmap and code all the bitmap entries
            // NB: This is lazy and inefficient, but simple
            match_symbols_t symbols = make_symbols(matches, 1);
            for (auto entry : symbols) {
                codec.encode(entry, model);
            }
        }

        uint32_t compressed_size = codec.stop_encoder();
        return buffer_t(codec.buffer(), codec.buffer() + compressed_size);
    }

    match_list_t decompress(buffer_t& compressed)
    {
        arithmetic_codec codec(static_cast<uint32_t>(compressed.size()), &compressed[0]);
        codec.start_decoder();

        // Read number of matches (20 bits)
        uint32_t match_count(codec.get_bits(20));

        match_list_t result;
        if (match_count > 0) {
            static_bit_model model;
            model.set_probability_0(get_probability_0(match_count));

            result.reserve(match_count);
            for (uint32_t i(0); i < NUM_VALUES; ++i) {
                uint32_t entry = codec.decode(model);
                if (entry == 1) {
                    result.push_back(i);
                }
            }
        }

        codec.stop_decoder();
        return result;
    }

private:
    double get_probability_0(uint32_t match_count, uint32_t num_values = NUM_VALUES)
    {
        double probability_0(double(num_values - match_count) / num_values);
        // Limit probability to match FastAC limitations...
        return std::max(0.0001, std::min(0.9999, probability_0));
    }
};

The second approach is to adapt the model based on the symbols we code. After each match is encoded, reduce the probability of the next match. Once all matches we coded, stop.

The second variation compresses slightly better, but at a noticeable performance cost.

class arithmetic_codec_v2
{
public:
    buffer_t compress(match_list_t const& matches)
    {
        uint32_t match_count(static_cast<uint32_t>(matches.size()));
        uint32_t total_count(NUM_VALUES);

        arithmetic_codec codec(static_cast<uint32_t>(NUM_VALUES / 4));
        codec.start_encoder();

        // Store the number of matches (1000000 needs only 20 bits)
        codec.put_bits(match_count, 20);

        if (match_count > 0) {
            static_bit_model model;

            // Create a bitmap and code all the bitmap entries
            // NB: This is lazy and inefficient, but simple
            match_symbols_t symbols = make_symbols(matches, 1);
            for (auto entry : symbols) {
                model.set_probability_0(get_probability_0(match_count, total_count));
                codec.encode(entry, model);
                --total_count;
                if (entry) {
                    --match_count;
                }
                if (match_count == 0) {
                    break;
                }
            }
        }

        uint32_t compressed_size = codec.stop_encoder();
        return buffer_t(codec.buffer(), codec.buffer() + compressed_size);
    }

    match_list_t decompress(buffer_t& compressed)
    {
        arithmetic_codec codec(static_cast<uint32_t>(compressed.size()), &compressed[0]);
        codec.start_decoder();

        // Read number of matches (20 bits)
        uint32_t match_count(codec.get_bits(20));
        uint32_t total_count(NUM_VALUES);

        match_list_t result;
        if (match_count > 0) {
            static_bit_model model;
            result.reserve(match_count);
            for (uint32_t i(0); i < NUM_VALUES; ++i) {
                model.set_probability_0(get_probability_0(match_count, NUM_VALUES - i));
                if (codec.decode(model) == 1) {
                    result.push_back(i);
                    --match_count;
                }
                if (match_count == 0) {
                    break;
                }
            }
        }

        codec.stop_decoder();
        return result;
    }

private:
    double get_probability_0(uint32_t match_count, uint32_t num_values = NUM_VALUES)
    {
        double probability_0(double(num_values - match_count) / num_values);
        // Limit probability to match FastAC limitations...
        return std::max(0.0001, std::min(0.9999, probability_0));
    }
};


Practical Approach

Practically, it's probalby not worth designing a new compression format. In fact, it might not even be worth it writing the results as bits, just make an array of bytes with values 0 or 1. Then use an existing compression library -- zlib is very common, or you could try lz4 or snappy, bzip2, lzma... the choices are plentiful.

ZLib Example

class zlib_codec
{
public:
    zlib_codec(uint32_t bits_per_symbol) : bits_per_symbol(bits_per_symbol) {}

    buffer_t compress(match_list_t const& matches)
    {
        match_symbols_t symbols(make_symbols(matches, bits_per_symbol));

        z_stream defstream;
        defstream.zalloc = nullptr;
        defstream.zfree = nullptr;
        defstream.opaque = nullptr;

        deflateInit(&defstream, Z_BEST_COMPRESSION);
        size_t max_compress_size = deflateBound(&defstream, static_cast<uLong>(symbols.size()));

        buffer_t compressed(max_compress_size);

        defstream.avail_in = static_cast<uInt>(symbols.size());
        defstream.next_in = &symbols[0];
        defstream.avail_out = static_cast<uInt>(max_compress_size);
        defstream.next_out = &compressed[0];

        deflate(&defstream, Z_FINISH);
        deflateEnd(&defstream);

        compressed.resize(defstream.total_out);
        return compressed;
    }

    match_list_t decompress(buffer_t& compressed)
    {
        z_stream infstream;
        infstream.zalloc = nullptr;
        infstream.zfree = nullptr;
        infstream.opaque = nullptr;

        inflateInit(&infstream);

        match_symbols_t symbols(symbol_count(bits_per_symbol));

        infstream.avail_in = static_cast<uInt>(compressed.size());
        infstream.next_in = &compressed[0];
        infstream.avail_out = static_cast<uInt>(symbols.size());
        infstream.next_out = &symbols[0];

        inflate(&infstream, Z_FINISH);
        inflateEnd(&infstream);

        return make_matches(symbols, bits_per_symbol);
    }
private:
    uint32_t bits_per_symbol;
};

BZip2 Example

class bzip2_codec
{
public:
    bzip2_codec(uint32_t bits_per_symbol) : bits_per_symbol(bits_per_symbol) {}

    buffer_t compress(match_list_t const& matches)
    {
        match_symbols_t symbols(make_symbols(matches, bits_per_symbol));

        uint32_t compressed_size = symbols.size() * 2;
        buffer_t compressed(compressed_size);

        int err = BZ2_bzBuffToBuffCompress((char*)&compressed[0]
            , &compressed_size
            , (char*)&symbols[0]
            , symbols.size()
            , 9
            , 0
            , 30);
        if (err != BZ_OK) {
            throw std::runtime_error("Compression error.");
        }

        compressed.resize(compressed_size);
        return compressed;
    }

    match_list_t decompress(buffer_t& compressed)
    {
        match_symbols_t symbols(symbol_count(bits_per_symbol));

        uint32_t decompressed_size = symbols.size();
        int err = BZ2_bzBuffToBuffDecompress((char*)&symbols[0]
            , &decompressed_size
            , (char*)&compressed[0]
            , compressed.size()
            , 0
            , 0);
        if (err != BZ_OK) {
            throw std::runtime_error("Compression error.");
        }
        if (decompressed_size != symbols.size()) {
            throw std::runtime_error("Size mismatch.");
        }

        return make_matches(symbols, bits_per_symbol);
    }
private:
    uint32_t bits_per_symbol;
};


Comparison


Code repository, including dependencies for 64bit Visual Studio 2015 is at https://github.com/dan-masek/bounded_sorted_list_compression.git

Obovate answered 24/9, 2019 at 1:34 Comment(2)
wow, what a great answer, this is fantastic -- hopefully others will find it useful as well!Octahedral
@Octahedral Thanks :) One more thing came to mind (based on a little coding competition we did almost a decade ago...) -- using a bitmap is a nice O(N) method to sort a list of unique bounded integers. If you're going to generate the bitmap before encoding, you don't even need the input sorted. | It might be useful to make some demonstration of coding the 20-bit integers or range reduction and doing some measurements on that in raw or compressed with general compressor. Might muck about with that when i have some time.Gate
A
3

Storing a compressed list of sorted integers is extremely common in data retrieval and database applications, and a variety of techniques have been developed.

I'm pretty sure that an unguessably random selection of about half of the items in your list is going to be your worst case.

Many popular integer-list-compression techniques, such as Roaring bitmaps, fallback to using (with such worst-case input data) a 1-bit-per-index bitmap.

So in your case, with 1 million rows, the maximum size payload returned would be (in the worst case) a header with the "using a bitmap" flag set, followed by a bitmap of 1 million bits (125,000 bytes), where for example the 700th bit of the bitmap is set to 1 if the 700th row in the database is a match, or set to 0 if the 700th row in the database does not match. (Thanks, Dan Mašek!)

My understanding is that, while quasi-succinct Elias-Fano compression and other techniques are very useful for compressing many "naturally-occurring" sets of sorted integers, for this worst-case data set, none of them give better compression, and most of them give far worse "compression", than a simple bitmap.

(This is analogous to the way most general-purpose data compression algorithms, such as DEFLATE, when fed "worst-case" data such as indistinguishable-from-random encrypted data, create "compressed" files with a few bytes of overhead with the "stored/raw/literal" flag set, followed by a simple copy of the uncompressed file).

Angelika answered 24/9, 2019 at 5:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.