marisa trie suffix compression?
Asked Answered
W

1

10

I'm using a custom Cython wrapper of this marisa trie library as a key-value multimap.

My trie entries look like key 0xff data1 0xff data2 to map key to the tuple (data1, data2). data1 is a string of variable length but data2 is always a 4-byte unsigned int. The 0xff is a delimiter byte.

I know a trie is not the most optimal data structure for this from a theoretical point of a view, but various practical considerations make it the best available choice.

In this use case, I have about 10-20 million keys, each one has on average 10 data points. data2 is redundant for many entries (in some cases, data2 is always the same for all data points for a given key), so I had the idea of taking the most frequent data2 entry and adding a ("", base_data2) data point to each key.

Since a MARISA trie, to my knowledge, does not have suffix compression and for a given key each data1 is unique, I assumed that this would save 4 bytes per data tuple that uses a redundant key (plus adding in a single 4-byte "value" for each key). Having rebuilt the trie, I checked that the redundant data was no longer being stored. I expected a sizable decrease in both serialized and in-memory size, but in fact the on-disk trie went from 566MB to 557MB (and a similar reduction in RAM usage for a loaded trie).

From this I concluded that I must be wrong about there being no suffix compression. I was now storing the entries with a redundant data2 number as key 0xff data1 0xff, so to test this theory I removed the trailing 0xff and adjusted the code that uses the trie to cope. The new trie went down from 557MB to 535MB.

So removing a single redundant trailing byte made a 2x larger improvement than removing the same number of 4-byte sequences, so either the suffix compression theory is dead wrong, or it's implemented in some very convoluted way.

My remaining theory is that adding in the ("", base_data2) entry at a higher point in the trie somehow throws off the compression in some terrible way, but it should just be adding in 4 more bytes when I've removed many more than that from lower down in the trie.

I'm not optimistic for a fix, but I'd dearly like to know why I'm seeing this behavior! Thank you for your attention.

Wivinia answered 3/7, 2017 at 23:46 Comment(1)
This smells a lot like memory alignment/padding, but I couldn't find a smoking gun in the MARISA code base. (I haven't checked the python wrapper yet though).Focus
F
6

As I suspected, it's caused by padding.

in lib/marisa/grimoire/vector/vector.h, there is the following function:

void write_(Writer &writer) const {
  writer.write((UInt64)total_size());
  writer.write(const_objs_, size_);
  writer.seek((8 - (total_size() % 8)) % 8);
}

The key point is: writer.seek((8 - (total_size() % 8)) % 8);. After writing each chunk, the writer pads to the next 8 bytes boundary.

This explains the behavior you are seeing, as part of the data removed by the initial shortening of the key was replaced with padding.

When you removed the extra byte, it brought the key size below the next boundary limit, resulting in a major size change.

Practically, what this means is that, since the padding code is in the serialization part of the library, you are probably getting the in-memory savings you were expecting, but that did not translate into on-disk savings. Monitoring program RAM usage should confirm that.

If disk size is your concern, then you might as well simply deflate the serialized data, as it seems MARISA does not apply any compression whatsoever.

Focus answered 12/7, 2017 at 4:53 Comment(2)
Fascinating! Thank you for digging through and finding that. Do you have any insight as to why that padding is being applied? It's slightly odd that the scores I was removing were 4 bytes in length, not 8, so removing the extra byte shouldn't bring down the size any more than removing any other byte, if the key lengths are homogeneously distributed.Wivinia
Generally, padding like this is added so that the resulting data can be casted to the desired type instead of being copied into a stack variable. MARISA doesn't do that. Also, padding is generally added before data that needs to be aligned, not after. It kinda looks a bit funky to me. As for your second question: Since it's the final buffer that's aligned, that behavior is really hard to predict. Maybe it's padding 5 bytes every 3 keys or something like that. It's really hard to gauge, and will probably vary a lot from dataset to dataset.Focus

© 2022 - 2024 — McMap. All rights reserved.