Python LZMA : Compressed data ended before the end-of-stream marker was reached
Asked Answered
B

1

6

I am using the built in lzma python to decode compressed chunk of data. Depending on the chunk of data, I get the following exception :

Compressed data ended before the end-of-stream marker was reached

The data is NOT corrupted. It can be decompressed correctly with other tools, so it must be a bug in the library. There are other people experiencing the same issue:

Unfortunately, none seems to have found a solution yet. At least, one that works on Python 3.5.

How can I solve this problem? Is there any work around?

Beaverboard answered 23/5, 2016 at 21:7 Comment(0)
B
8

I spent a lot of time trying to understand and solve this problem, so i thought it would a good idea to share it. The problem seems to be caused by the a chunk of data without the EOF byte properly set. In order to decompress a buffer, I used to use the lzma.decompress provided by the lzma python lib. However, this method expects each data buffer to contains a EOF bytes, otherwise it throws a LZMAError exception.

To work around this limitation, we can implement an alternative decompress function which uses LZMADecompress object to extract the data from a buffer. For example:

def decompress_lzma(data):
    results = []
    while True:
        decomp = LZMADecompressor(FORMAT_AUTO, None, None)
        try:
            res = decomp.decompress(data)
        except LZMAError:
            if results:
                break  # Leftover data is not a valid LZMA/XZ stream; ignore it.
            else:
                raise  # Error on the first iteration; bail out.
        results.append(res)
        data = decomp.unused_data
        if not data:
            break
        if not decomp.eof:
            raise LZMAError("Compressed data ended before the end-of-stream marker was reached")
    return b"".join(results)

This function is similar to the one provided by the standard lzma lib with one key difference. The loop is broken if the entire buffer has been processed, before checking if we reached the EOF mark.

I hope this can be useful to other people.

Beaverboard answered 23/5, 2016 at 21:7 Comment(2)
Interesting. In this case I would recommend checking the spec for the algorithm. It sounds like other tools might be more tolerant to incorrectly encoded buffers or faulty buffer copying. Depending on the spec, it's possible that the error is in the encoding and/or transmission, NOT decoding. I'm just making a suggestion, though. Could be way off.Carte
This was a major help. Our existing workarounds were to call 7z via subprocess. This is much better. Thanks, Giuseppe!Sportscast

© 2022 - 2024 — McMap. All rights reserved.