Is there a way to store gzip's dictionary from a file?
Asked Answered
M

2

10

I've been doing some research on compression-based text classification and I'm trying to figure out a way of storing a dictionary built by the encoder (on a training file) for use to run 'statically' on a test file? Is this at all possible using UNIX's gzip utility?

For example I have been using 2 'class' files of sport.txt and atheism.txt, hence I want to run compression on both of these files and store their dictionaries used. Next I want to take a test file (which is unlabelled, could be either atheism or sport) and by using the prebuilt dictionaries on this test.txt I can analyse how well it compresses under that dictionary/model.

Thanks

Marciamarciano answered 8/3, 2013 at 13:26 Comment(0)
H
13

deflate encoders, as in gzip and zlib, do not "build" a dictionary. They simply use the previous 32K bytes as a source for potential matches to the string of bytes starting at the current position. The last 32K bytes is called the "dictionary", but the name is perhaps misleading.

You can use zlib to experiment with preset dictionaries. See the deflateSetDictionary() and inflateSetDictionary() functions. In that case, zlib compression is primed with a "dictionary" of 32K bytes that effectively precede the first byte being compressed as a source for matches, but the dictionary itself is not compressed. The priming can only improve the compression of the first 32K bytes. After that, the preset dictionary is too far back to provide matches.

gzip provides no support for preset dictionaries.

Humdrum answered 8/3, 2013 at 16:10 Comment(3)
Thank you very much for the info, I didn't realise that is how gzip works. Would it be possible using lz78 (for example using compress utility) or lzw methods?Marciamarciano
LZW is ineffective as compared to modern methods. It would not be worth considering. To the extent that I understand what you're trying to do, you can use the deflate 32K dictionary for that. What you would do is to identify common strings in your representative data, and then pack a 32K dictionary with those strings. The compression algorithms for deflate won't help you do that. You would need to write your own code to find those common strings.Humdrum
Mark, can I ask another quick question? Something basic I can't get my head around. Does gzip read from the beginning of the file to the end, or the end of the file to the beginning when compressing?Marciamarciano
H
2

As of 2023, you can experiment with zstd easily. Contrary to gzip, zstd builds a compression dictionnary and provides methods to generate and store the dictionary.

Here is an example with the python binder python zstandard: https://python-zstandard.readthedocs.io/

import zstandard

ENCODING="UTF-8"

training_data = "my training text"
dictionary = zstandard.ZstdCompressionDict(training_data.encode(ENCODING), dict_type=zstandard.DICT_TYPE_RAWCONTENT)
compressor = zstandard.ZstdCompressor(dict_data=dictionary)
test_data = "my test text"
compressed = compressor.compress(test_data.encode(ENCODING))
compessed_length = len(compressed)

The ftcc project implements this approach end to end and provides accuracy benchmarks.

Disclaimer: I am the author of the ftcc project.

Huh answered 23/7, 2023 at 12:50 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.