Python ungzipping stream of bytes?
Asked Answered
U

5

37

Here is the situation:

  • I get gzipped xml documents from Amazon S3

      import boto
      from boto.s3.connection import S3Connection
      from boto.s3.key import Key
      conn = S3Connection('access Id', 'secret access key')
      b = conn.get_bucket('mydev.myorg')
      k = Key(b)
      k.key('documents/document.xml.gz')
    
  • I read them in file as

      import gzip
      f = open('/tmp/p', 'w')
      k.get_file(f)
      f.close()
      r = gzip.open('/tmp/p', 'rb')
      file_content = r.read()
      r.close()
    

Question

How can I ungzip the streams directly and read the contents?

I do not want to create temp files, they don't look good.

Underthrust answered 24/9, 2012 at 19:51 Comment(0)
O
41

Yes, you can use the zlib module to decompress byte streams:

import zlib

def stream_gzip_decompress(stream):
    dec = zlib.decompressobj(32 + zlib.MAX_WBITS)  # offset 32 to skip the header
    for chunk in stream:
        rv = dec.decompress(chunk)
        if rv:
            yield rv
    if dec.unused_data:
        # decompress and yield the remainder
        yield dec.flush()

The offset of 32 signals to the zlib header that the gzip header is expected but skipped.

The S3 key object is an iterator, so you can do:

for data in stream_gzip_decompress(k):
    # do something with the decompressed data
Orbit answered 24/9, 2012 at 20:0 Comment(3)
Would this need a call todec.flush() at the end to make sure to not miss any data?Quadripartite
@MichalCharemza: yes, that's a very good point. Add if dec.unused_data: yield dec.flush() to the end.Orbit
There might be a gotcha in the above? If the source is made up of concatenated gzip streams, then I think all after the first are silently eaten because the dobj has reached its eof. While there might be cases where it's desirable to only decompress the first, the gunzip command line program does decompress multiple I think - so a gotcha if you expect equivalence. (Which I did! Just encountered a file in the wild that we're suspecting had some sort of flush every 100,000 rows of the underlying data)Quadripartite
S
10

I had to do the same thing and this is how I did it:

import gzip
f = StringIO.StringIO()
k.get_file(f)
f.seek(0) #This is crucial
gzf = gzip.GzipFile(fileobj=f)
file_content = gzf.read()
Slipslop answered 18/10, 2012 at 14:41 Comment(2)
what is k here?Flowerlike
k is the Key; see original question k = Key(b)Rhinelandpalatinate
J
6

For Python3x and boto3-

So I used BytesIO to read the compressed file into a buffer object, then I used zipfile to open the decompressed stream as uncompressed data and I was able to get the datum line by line.

import io
import zipfile
import boto3
import sys

s3 = boto3.resource('s3', 'us-east-1')


def stream_zip_file():
    count = 0
    obj = s3.Object(
        bucket_name='MonkeyBusiness',
        key='/Daily/Business/Banana/{current-date}/banana.zip'
    )
    buffer = io.BytesIO(obj.get()["Body"].read())
    print (buffer)
    z = zipfile.ZipFile(buffer)
    foo2 = z.open(z.infolist()[0])
    print(sys.getsizeof(foo2))
    line_counter = 0
    for _ in foo2:
        line_counter += 1
    print (line_counter)
    z.close()


if __name__ == '__main__':
    stream_zip_file()
Jaysonjaywalk answered 26/9, 2017 at 21:4 Comment(2)
I noticed that the memory consumption increases significantly when we do buffer = io.BytesIO(obj.get()["Body"].read()). However read(1024) reading a certain portion of the data keeps the memory usage low!Airdrop
buffer = io.BytesIO(obj.get()["Body"].read()) reads the whole file into memory.Authorization
V
0

You can try PIPE and read contents without downloading file

    import subprocess
    c = subprocess.Popen(['-c','zcat -c <gzip file name>'], shell=True, stdout=subprocess.PIPE,         stderr=subprocess.PIPE)
    for row in c.stdout:
      print row

In addition "/dev/fd/" + str(c.stdout.fileno()) will provide you FIFO file name (Named pipe) which can be passed to other program.

Vaticination answered 24/9, 2012 at 21:1 Comment(4)
but you would you pass to zcat the zipped bytes coming from S3 ?Extraneous
zcat is not particularly portable, you'd better use gunzip -cTowle
Shelling out to do this with all the overhead of process setup and the like is absolutely the wrong thing to do.Junkojunkyard
I absolutely agree, CW. But I love to see it offered.Colorless
P
0

I did it this way for gzip files:

import gzip
import boto3

s3 = boto3.resource('s3')
obj = s3.Object(bucket_name='Bucket', key='file.gz')
with gzip.GzipFile(fileobj=obj.get()["Body"]) as file:
    for line_bytes in file:
        print(line_bytes)
Parotitis answered 29/2, 2024 at 12:44 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.