Python ungzipping stream of bytes?

Asked 24/9, 2012 at 19:51 Answered 29/2, 2024 at 12:44

Here is the situation:

I get gzipped xml documents from Amazon S3

  import boto
  from boto.s3.connection import S3Connection
  from boto.s3.key import Key
  conn = S3Connection('access Id', 'secret access key')
  b = conn.get_bucket('mydev.myorg')
  k = Key(b)
  k.key('documents/document.xml.gz')

I read them in file as

  import gzip
  f = open('/tmp/p', 'w')
  k.get_file(f)
  f.close()
  r = gzip.open('/tmp/p', 'rb')
  file_content = r.read()
  r.close()

Question

How can I ungzip the streams directly and read the contents?

I do not want to create temp files, they don't look good.

Underthrust answered 24/9, 2012 at 19:51 Comment(0)

Yes, you can use the zlib module to decompress byte streams:

import zlib

def stream_gzip_decompress(stream):
    dec = zlib.decompressobj(32 + zlib.MAX_WBITS)  # offset 32 to skip the header
    for chunk in stream:
        rv = dec.decompress(chunk)
        if rv:
            yield rv
    if dec.unused_data:
        # decompress and yield the remainder
        yield dec.flush()

The offset of 32 signals to the zlib header that the gzip header is expected but skipped.

The S3 key object is an iterator, so you can do:

for data in stream_gzip_decompress(k):
    # do something with the decompressed data

Orbit answered 24/9, 2012 at 20:0 Comment(3)

Would this need a call todec.flush() at the end to make sure to not miss any data? – Quadripartite 16/6, 2023 at 10:10

@MichalCharemza: yes, that's a very good point. Add if dec.unused_data: yield dec.flush() to the end. – Orbit 17/6, 2023 at 19:3

There might be a gotcha in the above? If the source is made up of concatenated gzip streams, then I think all after the first are silently eaten because the dobj has reached its eof. While there might be cases where it's desirable to only decompress the first, the gunzip command line program does decompress multiple I think - so a gotcha if you expect equivalence. (Which I did! Just encountered a file in the wild that we're suspecting had some sort of flush every 100,000 rows of the underlying data) – Quadripartite 13/5, 2024 at 21:24

I had to do the same thing and this is how I did it:

import gzip
f = StringIO.StringIO()
k.get_file(f)
f.seek(0) #This is crucial
gzf = gzip.GzipFile(fileobj=f)
file_content = gzf.read()

Slipslop answered 18/10, 2012 at 14:41 Comment(2)

what is k here? – Flowerlike 23/12, 2020 at 23:29

k is the Key; see original question k = Key(b) – Rhinelandpalatinate 13/4, 2024 at 7:19

For Python3x and boto3-

So I used BytesIO to read the compressed file into a buffer object, then I used zipfile to open the decompressed stream as uncompressed data and I was able to get the datum line by line.

import io
import zipfile
import boto3
import sys

s3 = boto3.resource('s3', 'us-east-1')


def stream_zip_file():
    count = 0
    obj = s3.Object(
        bucket_name='MonkeyBusiness',
        key='/Daily/Business/Banana/{current-date}/banana.zip'
    )
    buffer = io.BytesIO(obj.get()["Body"].read())
    print (buffer)
    z = zipfile.ZipFile(buffer)
    foo2 = z.open(z.infolist()[0])
    print(sys.getsizeof(foo2))
    line_counter = 0
    for _ in foo2:
        line_counter += 1
    print (line_counter)
    z.close()


if __name__ == '__main__':
    stream_zip_file()

Jaysonjaywalk answered 26/9, 2017 at 21:4 Comment(2)

I noticed that the memory consumption increases significantly when we do buffer = io.BytesIO(obj.get()["Body"].read()). However read(1024) reading a certain portion of the data keeps the memory usage low! – Airdrop 19/3, 2018 at 21:52

buffer = io.BytesIO(obj.get()["Body"].read()) reads the whole file into memory. – Authorization 11/5, 2018 at 18:39

You can try PIPE and read contents without downloading file

    import subprocess
    c = subprocess.Popen(['-c','zcat -c <gzip file name>'], shell=True, stdout=subprocess.PIPE,         stderr=subprocess.PIPE)
    for row in c.stdout:
      print row

In addition "/dev/fd/" + str(c.stdout.fileno()) will provide you FIFO file name (Named pipe) which can be passed to other program.

Vaticination answered 24/9, 2012 at 21:1 Comment(4)

but you would you pass to zcat the zipped bytes coming from S3 ? – Extraneous 7/4, 2017 at 1:28

zcat is not particularly portable, you'd better use gunzip -c – Towle 4/1, 2018 at 19:4

Shelling out to do this with all the overhead of process setup and the like is absolutely the wrong thing to do. – Junkojunkyard 8/2, 2019 at 18:37

I absolutely agree, CW. But I love to see it offered. – Colorless 30/6, 2020 at 17:44

I did it this way for gzip files:

import gzip
import boto3

s3 = boto3.resource('s3')
obj = s3.Object(bucket_name='Bucket', key='file.gz')
with gzip.GzipFile(fileobj=obj.get()["Body"]) as file:
    for line_bytes in file:
        print(line_bytes)

Parotitis answered 29/2, 2024 at 12:44 Comment(0)

Recommended topics

Hot tags