How to decompress mongo journal files
Asked Answered
U

1

7

As I have explored, journal files created by Mongodb is compressed using snappy compression algorithm. but I am not able to decompress this compressed journal file. It gives an error on trying to decompress

Error stream missing snappy identifier

the python code I have used to decompress is as follows:

import collections
import bson
from bson.codec_options import CodecOptions
import snappy
from cStringIO import StringIO
try:
    with open('journal/WiredTigerLog.0000000011') as f:
        content = f.readlines()
        fh = StringIO()
        snappy.stream_decompress(StringIO("".join(content)),fh)
        print fh
except Exception,e:
    print str(e)
    pass

please help i can't make my way after this

Universalism answered 6/2, 2017 at 11:22 Comment(3)
Maybe your journal isn't compressed. Try to open it in a hex-editor and see if you can read your plain data.Cerebrum
Ditto what @RetoAebersold said. It seems to not be finding the expected Snappy header.Goines
Tried your code snippet and it worked on framed snappy data. Adding to what others noted, if you open the file in a hex editor, it should be apparent whether it's snappy framed data. The signature is (starting at file offset zero): \377\006\0\0sNaPpY as from *nix magic file or ff06 0000 734e 6150 7059 in hex. Perhaps the WiredTiger Storage Engine is writing using a different compression option?Abhenry
O
1

There's two forms of Snappy compression, the basic form and the streaming form. The basic form has the limitation that it all must fit in memory, so the streaming form exists to be able to compress larger amounts of data. The streaming format has a header and then subranges that are compressed. If the header is missing, it sounds like maybe you compressed using the basic form and are trying to uncompress with the streaming form. https://github.com/andrix/python-snappy/issues/40

If that is the case, use decompress instead of stream_decompress.

But if could be that the data isn't compressed at all:

with open('journal/WiredTigerLog.0000000011') as f:
    for line in f:
        print line

could work.

Minimum log record size for WiredTiger is 128 bytes. If a log record is 128 bytes or smaller, WiredTiger does not compress that record. https://docs.mongodb.com/manual/core/journaling/

Officiary answered 20/2, 2017 at 13:7 Comment(1)
Since WiredTiger only compresses records which are larger that 128 bytes. How will we detect that which lines are compressed and which are not?Danieledaniell

© 2022 - 2024 — McMap. All rights reserved.