I'm using a custom Cython wrapper of this marisa trie library as a key-value multimap.
My trie entries look like key 0xff data1 0xff data2
to map key
to the tuple (data1, data2)
. data1
is a string of variable length but data2
is always a 4-byte unsigned int. The 0xff
is a delimiter byte.
I know a trie is not the most optimal data structure for this from a theoretical point of a view, but various practical considerations make it the best available choice.
In this use case, I have about 10-20 million keys, each one has on average 10 data points. data2
is redundant for many entries (in some cases, data2
is always the same for all data points for a given key), so I had the idea of taking the most frequent data2
entry and adding a ("", base_data2)
data point to each key.
Since a MARISA trie, to my knowledge, does not have suffix compression and for a given key each data1
is unique, I assumed that this would save 4 bytes per data tuple that uses a redundant key (plus adding in a single 4-byte "value" for each key). Having rebuilt the trie, I checked that the redundant data was no longer being stored. I expected a sizable decrease in both serialized and in-memory size, but in fact the on-disk trie went from 566MB to 557MB (and a similar reduction in RAM usage for a loaded trie).
From this I concluded that I must be wrong about there being no suffix compression. I was now storing the entries with a redundant data2
number as key 0xff data1 0xff
, so to test this theory I removed the trailing 0xff
and adjusted the code that uses the trie to cope. The new trie went down from 557MB to 535MB.
So removing a single redundant trailing byte made a 2x larger improvement than removing the same number of 4-byte sequences, so either the suffix compression theory is dead wrong, or it's implemented in some very convoluted way.
My remaining theory is that adding in the ("", base_data2)
entry at a higher point in the trie somehow throws off the compression in some terrible way, but it should just be adding in 4 more bytes when I've removed many more than that from lower down in the trie.
I'm not optimistic for a fix, but I'd dearly like to know why I'm seeing this behavior! Thank you for your attention.