I've got a special need and the most important concerns are:
- in-memory
- very low memory footprint
- speed
Here's my "problem": I need to store, in-memory, a huge number of very sparse bit arrays. Those bitsets are "append only" and are to be used mostly for intersections. By huge, I mean as high as 200 000 bit arrays.
The range shall be between [0...16 000 000] for each bitset.
I ran some pre-test with "only" 10 673 bit arrays containing some actual data I've got and got the following results:
1% of the bit arrays ( 106 bit arrays) Hamming weight: at most 1 bit set
5% of the bit arrays ( 534 bit arrays) Hamming weight: at most 4 bits set
10% of the bit arrays ( 1068 bit arrays) Hamming weight: at most 8 bits set
15% of the bit arrays ( 1603 bit arrays) Hamming weight: at most 12 bits set
20% of the bit arrays ( 2137 bit arrays) Hamming weight: at most 17 bits set
25% of the bit arrays ( 2671 bit arrays) Hamming weight: at most 22 bits set
30% of the bit arrays ( 3206 bit arrays) Hamming weight: at most 28 bits set
35% of the bit arrays ( 3740 bit arrays) Hamming weight: at most 35 bits set
40% of the bit arrays ( 4274 bit arrays) Hamming weight: at most 44 bits set
45% of the bit arrays ( 4809 bit arrays) Hamming weight: at most 55 bits set
50% of the bit arrays ( 5343 bit arrays) Hamming weight: at most 67 bits set
55% of the bit arrays ( 5877 bit arrays) Hamming weight: at most 83 bits set
60% of the bit arrays ( 6412 bit arrays) Hamming weight: at most 103 bits set
65% of the bit arrays ( 6946 bit arrays) Hamming weight: at most 128 bits set
70% of the bit arrays ( 7480 bit arrays) Hamming weight: at most 161 bits set
75% of the bit arrays ( 8015 bit arrays) Hamming weight: at most 206 bits set
80% of the bit arrays ( 8549 bit arrays) Hamming weight: at most 275 bits set
85% of the bit arrays ( 9083 bit arrays) Hamming weight: at most 395 bits set
90% of the bit arrays ( 9618 bit arrays) Hamming weight: at most 640 bits set
95% of the bit arrays (10152 bit arrays) Hamming weight: at most 1453 bits set
96% of the bit arrays (10259 bit arrays) Hamming weight: at most 1843 bits set
97% of the bit arrays (10366 bit arrays) Hamming weight: at most 2601 bits set
98% of the bit arrays (10473 bit arrays) Hamming weight: at most 3544 bits set
99% of the bit arrays (10580 bit arrays) Hamming weight: at most 4992 bits set
100% of the bit arrays (10687 bit arrays) Hamming weight: at most 53153 bits set
Seen the numbers involved, I obviously need to use compressed bit arrays and that is not an issue: it shall stay easy to deal with seen that the bit arrays are "append only".
The bit array bits that are on are kinda grouped, but not totally. So you'll tend to have several bits on in the same area (but usually not one after another, making RLE kinda not great for bits that are on).
My question is what kind of compression to use?
Now I don't know if I should put my first approach here or in an answer to my own question.
Basically I imagined a "worst case" scenario using a very dumb encoding:
1 bit: if on, the following 5 bits determine how many bits are needed to compute the 'skip', if off, optimization: the following 5 bits determine how many bits are too be taken literally (that is 'on' or 'off', without skipping) [this would only be switched to when determined to be more efficient than the other representation, so when it kicks in, it shall always be an optimization (size-wise)]
5 bits: how many bits we can skip before the next bit on
x bits: skip
Here's an example: a bit array has 3 bit set, the first bit being at 3 098 137, the second at 3 098 141 and the third at 3 098 143.
+-- now we won't skip
|
| +-- 3 because we need 3 bits to store "6" (from 3 098 138 to 3 098 143)
| | +--- 3 098 141 is on
22 3 098 137 | 3 | +- 3 098 143 is on
1 10110 1011110100011000011001 0 00011 000101 etc.
First bit on tells we're going to skip bits. 5 next bits (always 5) tells how many bits we need to tell how many bits we'll skip 22 bits telling to skip to 3 098 137 one bit off telling now we're not skipping bits 5 next bits (always 5) tells how many bits we'll read "as is" 6 bits: off, off, off, on, off, on meaning 3 098 141 and 3 098 143 are on etc.
Seen the amazing sparsity of these bit arrays, this seems quite size-efficient.
So using that encoding, I took my sample data and computed a "worst case" scenario (I haven't written the algo yet, I'd rather have a few from here inputs first): basically I considered that not only the "size optimization" would never kick in and, also, that the 5 bits would always be set to their maximum value (24 bits), which of course cannot happen.
I did it just to have a very crude approximation of what the "worst of the worst" case could be.
I was very pleasantly surprised:
Worst case scenario:
108 913 290 bits needed for the 10 687 very sparse bit arrays
12.9 MB (13 295 KB)
The data being actual data and all the data being similar, I know that, if worse comes to worse, I could store my 200 000 bit arrays in about 240 MB, which is fine.
I'm pretty sure the actual encoding will comes to way less than that but as I haven't actually written it yet, I can only (very easily) compute the "worst case" which is why I only show that one.
Any hints / ideas as to how to make this more size-efficient (remembering these are super-sparse bit arrays, that there shall be hundreds thousands of them, that they must be in memory, and that they shall be "append only")?
About my 'append-only' case
Basically I've got one growing "expanse" (the range, but "expanse" is the actual term as I understand it) and a lot of bit arrays that have a few bit sets. When the range goes from, say, 0 to 1 000 000, all the bit arrays goes from 0 to 1 000 000 to. When the range grows to 1 000 001, then all the bit arrays are growing too, all by one bit. But most of these bit arrays will have a '0' appended at their end, while about 4 to 8 of the bit arrays will have a '1' appended at their end. However I cannot predict in advance which of the bit arrays will have a 0 or a 1 appended.
So I've got a lot of bit arrays that have all the same size, that are all very sparse (< 0.5% of their bits set) and that are all "growing" as the range growth (so they're all always growing at the same rate).
Judy arrays are great. But I read about them a few years ago and that stuff was "above my head". Judy arrays are a C-only 20KLOC lib and I'm definitely not re-implementing that. But they're amazing.
So I guess I need to add I'd like all this to stay relatively simple, which is not that far-fetched seen the special "append only" property of my very sparse bit arrays.