algorithm: gigantic number of very sparse bit arrays, which encoding to use
Asked Answered
G

6

10

I've got a special need and the most important concerns are:

  • in-memory
  • very low memory footprint
  • speed

Here's my "problem": I need to store, in-memory, a huge number of very sparse bit arrays. Those bitsets are "append only" and are to be used mostly for intersections. By huge, I mean as high as 200 000 bit arrays.

The range shall be between [0...16 000 000] for each bitset.

I ran some pre-test with "only" 10 673 bit arrays containing some actual data I've got and got the following results:

  1% of the bit arrays (  106 bit arrays) Hamming weight: at most     1 bit  set
  5% of the bit arrays (  534 bit arrays) Hamming weight: at most     4 bits set
 10% of the bit arrays ( 1068 bit arrays) Hamming weight: at most     8 bits set
 15% of the bit arrays ( 1603 bit arrays) Hamming weight: at most    12 bits set
 20% of the bit arrays ( 2137 bit arrays) Hamming weight: at most    17 bits set
 25% of the bit arrays ( 2671 bit arrays) Hamming weight: at most    22 bits set
 30% of the bit arrays ( 3206 bit arrays) Hamming weight: at most    28 bits set
 35% of the bit arrays ( 3740 bit arrays) Hamming weight: at most    35 bits set
 40% of the bit arrays ( 4274 bit arrays) Hamming weight: at most    44 bits set
 45% of the bit arrays ( 4809 bit arrays) Hamming weight: at most    55 bits set
 50% of the bit arrays ( 5343 bit arrays) Hamming weight: at most    67 bits set
 55% of the bit arrays ( 5877 bit arrays) Hamming weight: at most    83 bits set
 60% of the bit arrays ( 6412 bit arrays) Hamming weight: at most   103 bits set
 65% of the bit arrays ( 6946 bit arrays) Hamming weight: at most   128 bits set
 70% of the bit arrays ( 7480 bit arrays) Hamming weight: at most   161 bits set
 75% of the bit arrays ( 8015 bit arrays) Hamming weight: at most   206 bits set
 80% of the bit arrays ( 8549 bit arrays) Hamming weight: at most   275 bits set
 85% of the bit arrays ( 9083 bit arrays) Hamming weight: at most   395 bits set
 90% of the bit arrays ( 9618 bit arrays) Hamming weight: at most   640 bits set
 95% of the bit arrays (10152 bit arrays) Hamming weight: at most  1453 bits set
 96% of the bit arrays (10259 bit arrays) Hamming weight: at most  1843 bits set
 97% of the bit arrays (10366 bit arrays) Hamming weight: at most  2601 bits set
 98% of the bit arrays (10473 bit arrays) Hamming weight: at most  3544 bits set
 99% of the bit arrays (10580 bit arrays) Hamming weight: at most  4992 bits set
100% of the bit arrays (10687 bit arrays) Hamming weight: at most 53153 bits set

Seen the numbers involved, I obviously need to use compressed bit arrays and that is not an issue: it shall stay easy to deal with seen that the bit arrays are "append only".

The bit array bits that are on are kinda grouped, but not totally. So you'll tend to have several bits on in the same area (but usually not one after another, making RLE kinda not great for bits that are on).

My question is what kind of compression to use?

Now I don't know if I should put my first approach here or in an answer to my own question.

Basically I imagined a "worst case" scenario using a very dumb encoding:

  • 1 bit: if on, the following 5 bits determine how many bits are needed to compute the 'skip', if off, optimization: the following 5 bits determine how many bits are too be taken literally (that is 'on' or 'off', without skipping) [this would only be switched to when determined to be more efficient than the other representation, so when it kicks in, it shall always be an optimization (size-wise)]

  • 5 bits: how many bits we can skip before the next bit on

  • x bits: skip

Here's an example: a bit array has 3 bit set, the first bit being at 3 098 137, the second at 3 098 141 and the third at 3 098 143.

                               +-- now we won't skip
                               |
                               |     +-- 3 because we need 3 bits to store "6" (from 3 098 138 to 3 098 143)
                               |     |    +--- 3 098 141 is on
  22    3 098 137              |     3    | +- 3 098 143 is on
1 10110 1011110100011000011001 0 00011 000101 etc. 

First bit on tells we're going to skip bits. 5 next bits (always 5) tells how many bits we need to tell how many bits we'll skip 22 bits telling to skip to 3 098 137 one bit off telling now we're not skipping bits 5 next bits (always 5) tells how many bits we'll read "as is" 6 bits: off, off, off, on, off, on meaning 3 098 141 and 3 098 143 are on etc.

Seen the amazing sparsity of these bit arrays, this seems quite size-efficient.

So using that encoding, I took my sample data and computed a "worst case" scenario (I haven't written the algo yet, I'd rather have a few from here inputs first): basically I considered that not only the "size optimization" would never kick in and, also, that the 5 bits would always be set to their maximum value (24 bits), which of course cannot happen.

I did it just to have a very crude approximation of what the "worst of the worst" case could be.

I was very pleasantly surprised:

Worst case scenario: 

108 913 290 bits needed for the 10 687 very sparse bit arrays
12.9 MB (13 295 KB)

The data being actual data and all the data being similar, I know that, if worse comes to worse, I could store my 200 000 bit arrays in about 240 MB, which is fine.

I'm pretty sure the actual encoding will comes to way less than that but as I haven't actually written it yet, I can only (very easily) compute the "worst case" which is why I only show that one.

Any hints / ideas as to how to make this more size-efficient (remembering these are super-sparse bit arrays, that there shall be hundreds thousands of them, that they must be in memory, and that they shall be "append only")?

About my 'append-only' case

Basically I've got one growing "expanse" (the range, but "expanse" is the actual term as I understand it) and a lot of bit arrays that have a few bit sets. When the range goes from, say, 0 to 1 000 000, all the bit arrays goes from 0 to 1 000 000 to. When the range grows to 1 000 001, then all the bit arrays are growing too, all by one bit. But most of these bit arrays will have a '0' appended at their end, while about 4 to 8 of the bit arrays will have a '1' appended at their end. However I cannot predict in advance which of the bit arrays will have a 0 or a 1 appended.

So I've got a lot of bit arrays that have all the same size, that are all very sparse (< 0.5% of their bits set) and that are all "growing" as the range growth (so they're all always growing at the same rate).


Judy arrays are great. But I read about them a few years ago and that stuff was "above my head". Judy arrays are a C-only 20KLOC lib and I'm definitely not re-implementing that. But they're amazing.

So I guess I need to add I'd like all this to stay relatively simple, which is not that far-fetched seen the special "append only" property of my very sparse bit arrays.

Greenlaw answered 1/11, 2010 at 23:26 Comment(5)
Note that comments about re-inventing the wheel can be sent to /dev/null: if only for the math/challenge behind it I want to implement this myself. And anyway I'd be very surprised to find a wheel that can deal with 200 000 "append-only" bit arrays in memory :) But if you've got one, the mechanics behind it interest me a lot :)Greenlaw
There's theoretical limit on coding density: with array of N elements, n of which are set, minimum number of bits to encode would be -n*log2(n/N)-(N-n)*log(1-n/N). For your array in which 53153 of 16M is set this would be 514kBits and for 4992 bits set - 65 kBits. And closer your memory to this limit, more complex encoding you have to choose.Breland
@Vovanium, I think you left out some necessary context for your theoretical limit (like, some kind of statistical assumption about the distribution of bits being set?)Evidently
I thought about uniform bit distribution (i. e. every 1 have constant probability p = n/N). Exact limit for n bits set of N is log2[C(N,n)] which is just number of bits in number of combinations and is slightly lower. But for large N this formula is hard to compute.Breland
"succinct data structures" would be a relevant keyword for anyone interested in this questionEventuality
C
4

You didn't say what programming language you want to use. It sounds like you don't want Judy because it's "C-only"... if you are using C# then you could use my Compact Patricia Trie instead. Is is almost 4500 LOC (commented) and uses similar ideas to Judy, but the size and speed of each trie are not ideal due to limitations of .NET. It is not optimized for computing intersections either, but such an algorithm could be added. The article about CP Tries does not emphasize this point, but it can store sets (sparse bit arrays) much more compactly than dictionaries (the graphs in the article show the size and speed of dictionaries, not sets).

The best case is a dense cluster of bits. With 50% occupancy (every other bit set), it requires less than 8 bits per key (less than 4 bits per integer). (correction: less than 8 bits, not more.)

If you only need an approximate representation of the data, use a Bloom filter.

By the way, what do you mean by "append only"? Does it mean that you only add keys, or that each key you add is greater than the keys you added before?

Update: Since you are only adding larger keys, you should probably design a special algorithm just for your case. IMO, when designing a custom algorithm, you should make it as simple as possible. So here's my idea, which assumes the keys of different bitsets are uncorrelated (therefore there is no benefit of attempting to compress data between different bitsets):

A bitset is represented by a sorted array of 32-bit slots. Because it's sorted, you can use binary search to find keys. Each slot consists of a 24-bit "prefix" and 8 bits of "flags". Each slot represents a region of 8 keys. The "flags" tell you which of the 8 keys in the region are present in the bitset, and the "prefix" tells you which region we're talking about, by specifying bits 3 to 26 of the key. For example, if the following bits are "1" in the bitset:

1, 3, 4, 1094, 8001, 8002, 8007, 8009

...then the bitset is represented by an array of 4 slots (16 bytes):

Prefix:     0,  136, 1000, 1001
 Flags:  0x15, 0x40, 0x86, 0x02

The first slot represents 1, 3, 4 (notice that bits 1, 3 and 4 are set in the number 0x15); the second slot represents 1094 (136 * 8 + 6); the third slot represents 8001, 8002, and 8007; the fourth slot represents 8009. Does this make sense?

I don't know if this is as compact as your idea. But I think you'll get faster queries and faster modifications, and it will be fairly easy to implement.

Cwm answered 2/11, 2010 at 16:18 Comment(4)
+1, nice answer. Don't know much about Patricia Trie yet (besides the name that I already heard), will read. Yup, by "append only" I mean that as the "expanse" (the range) grows, some of the bit arrays (typically 4 to 8) will have a bit set at the end of the bit array. So I never "insert" any bit in the middle of a bit array. So it's really a special case that, I think, makes things much easier.Greenlaw
I guess that by "append only" I mean both that I only add keys and that the key is also always greater than the key I added before.Greenlaw
I wish I could give more than +1, your article looks excellent, so does your C# implementation of "CPT". Actually the language I'm after is probably Java but I may need to have an easy way to port this to both C# and Objective-C... So I'd rather have something relatively easy. But your Compact Patricia Trie looks amazing. Once again my case is very special: most of my bit arrays have not even 0.5% of each bit set, so it is really super sparse.Greenlaw
can't use Bloom filter btw, need exact representation of the data.Greenlaw
B
3

You may use binary tree for bit array. Say, you have array with range of [M..N]. Store it in such a manner:

Choose some number encoding for [0...ram size], like Fibonacci, Golomb or Rice code (you may choose most suitable representation after profiling your program with actual data).

  1. If array is empty (have no bits set), store it as number 0.
  2. If array is full (have all bits set), store it as number 1.
  3. Else split it in two parts: A in [M..(M+N)/2-1] and B in [(M+N)/2..N]
  4. Generate representations of P0 and P1 using this algorithm recursively.
  5. Get length of P0 (in bits or other units length may be whole number of) and store it as a number (you may need to add 1 if length may be 1, e.g. you store 0 as single bit 0).
  6. Store P0 then P1.

In this case, if limits are common, operations of intersection an union are trivial recursions:

Intersection:

  1. If array A is empty, store 0.
  2. If array A is full, store copy of B
  3. Else split arrays, make intersections of both halves, store length of first half, then both halves.

This algorithm may deal with bits (if you need them to be most compact) and bytes/words (if bit operations are so slow).

Also you may add specical encodings for arrays with single bit set, all arrays with size less than some limit (8 elements for example) to decrease level of recursion.

Drawback is that without some hacks adding/removing element to/from array is complex operation (as complex as intersection/union operations).

For example, array with single 0xAB bit set should be stored in array of 0..0xFF as (pseudocode for):

0  1      2   3  4      5  6  7      8  9  10     11 12     13    14     15     16
1, EMPTY, 13, 1, EMPTY, 9, 1, EMPTY, 5, 1, EMPTY, 1, EMPTY, FULL, EMPTY, EMPTY, EMPTY
                                                   |  AA  | AB  |
                                         |A8..A9| AA    ..   AB |
                                      | A8         ..        AB |AC..AF|
                            |A0..A7| A8             ..              AF |
                         | A0                 ..                    AF |B0..BF|
               |80..9F| A0                        ..                       BF |
            | 80                                 ..                        BF |C0..FF|
 | 0..7F| 80                                          ..                          FF |       

EMPTY and FULL are codes for empty and full arrays, numbers are lengths in elements (should be replaced with actual lengthts in bytes, bits or so)

Ff you do not need fast single bit check, you may use most simple approach: Just store distances between set bits using codes: fibonacci, rice, golomb, levenshtein, elias etc. or invent another one. Note, that in order to get minimal code length, you should use code with code lengths as close as possible to -log p/log 2, where p is probability of that code. You may use huffman code for that.

For example, use elias gamma code, so array like this:

0 1 0000 1 1 000 1 0 1 000000000000000000 1 000000000000000000 
2     5   1   4    2         19                   18           (distance)

Should be encoded as:

010 00101 1 00100 010 000010011 000010010
 2    5   1   4    2     19        18     (distance code explained)

And mostly compact for array with uniform bits distribution would be arithmetic encoding, but it is very CPU time counsumpting. Becaouse you'll have to read and write such arrays bit by bit with no fast skipping available.

Breland answered 2/11, 2010 at 15:35 Comment(2)
+1, great answer too. I don't know yet which route I'll go but this sure gives food for thoughts :)Greenlaw
Thanks. Also I may recommend to look how made various sound compression algorithms (MP2, AAC etc.). They deal with sparse arrays (like 0, 0, 0, 1, 0, -1, 1, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0) when compressing high frequency spectra.Breland
D
3

You may look into compressed bitmaps. A common strategy is to use word-aligned run-length encoding.

C++ implementation:

https://github.com/lemire/EWAHBoolArray

Java implementation:

https://github.com/lemire/javaewah

Reference:

Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering 69 (1), pages 3-28, 2010. http://arxiv.org/abs/0901.3751

Downpipe answered 4/11, 2010 at 12:44 Comment(0)
E
2

Even if they aren't exactly what you're looking for, it's worth checking out Judy trees. Judy is a heavily optimized library for ordered maps, and one configuration is specifically designed as a bitset rather than a map. I don't think intersection is one of the operations natively optimized for, though...

The general idea is to use a tree with a fixed number of address bits per level, and take advantage of the sparseness at each level. This results in quite good compression even in the worst case, and fast query performance as well. I believe an intersection operation would be relatively straightforward and potentially very fast.

At any rate, it's always a good idea to steal from the best!

Evidently answered 1/11, 2010 at 23:52 Comment(4)
yup Judy arrays are great but honestly the math behind it are a bit too complicated for me :) And AFAICT it's only available as a 20KLOC C-written lib :-/ I'm definitely re-inventing that wheel :)Greenlaw
Damn, I meant, I'm definitely not re-inventing that wheel :) Obviously :)Greenlaw
No need to reinvent their wheel, but the basic principle seems like just the kind of thing you're looking for: highly sparse, and readily adaptable to writing a fast intersection function.Evidently
I know I know but... But the Judy implementation is a 20 000 lines codebase. It is really one of the most difficult-to-implement data-structure ever written :)Greenlaw
F
1

Considering you are going to do a bunch of intersection tests anyway, maybe you should try storing all of the bitvectors in parallel. One sparse, 16M entry list. Each entry in that list contains a list of which of the 200k input bitvectors has a '1' at that location. It looks like you expect to have only about 5 bits set per input vector, or 1M total entries? Taking a straw-man linked list implementation for the toplevel and the buckets, and a worst case of no intersections at all (thus 1M buckets with 1 element each) you could store it all in 32MB.

Farmann answered 2/11, 2010 at 0:19 Comment(5)
no no, the list I posted shows it, for example: "50% of the bitvectors will have [between 55 and] 67 bits set". There will be much much more than 1M total entries. With 200K bitvectors I'd say there'd be, very grossly, a total of 100 million bits set.Greenlaw
I didn't look at it this way but now that you mention doing it "the other way", it's a guaranteed that every single of the "expanse" (the 16 million range) will be used a few times. The way you phrased it, each entry in the 16M list would have about 4 to 8 bits set.Greenlaw
Aha, I thought that was a total, thus 55k/10k = 5, my mistake. So, no reason to make the 16M array sparse, each entry needs room for about 8 18-bit (2^18 > 200k arrays) identifiers, so 288MB. Similar to your estimate.Farmann
another problem is that I need an easy way to find, for example, "all the bits that are on for bit array number 190 834". I don't know how I could do this fastly if I had to parse the 16M entry list.Greenlaw
Kinda similar to the worst case I got. But I'm pretty sure it's going to be quite lower once I implement it :) Because I think that switching between RLE (skip 'x' bits) and read-x-bits-as-is will work great on my dataset (to be seen but hey). Also I'm pretty sure I won't often need 24 bits to store the 'skip' (and obviously as I progress into the data, less and less bits will be needed for the 'skip', so I really took a worse-case-near-impossible scenario :)Greenlaw
L
1

You might be interested in Binary Decision Diagrams (BDD), and more precisely Zero-suppressed Binary Decision Diagram (ZBDD).

They are used to represent sets in a compressed way. Unlike other compressed forms, operations (such as set intersections, or insertions of elements - your "append only" thing?) work directly on the compressed form.

Lofty answered 2/11, 2010 at 16:58 Comment(1)
I edited a bit my question to clarify the "append only thing". Basically the bit arrays are ever growing (up to max 16 000 000 bits) and I'm always only modifying the end of it, so it's kinda easy to work directly on the compressed form.Greenlaw

© 2022 - 2024 — McMap. All rights reserved.