How do I gzip compress a string in Python?
Asked Answered
K

6

101

How do I gzip compress a string in Python?

gzip.GzipFile exists, but that's for file objects - what about with plain strings?

Kenway answered 14/12, 2011 at 15:16 Comment(6)
@KevinDTimm, that docu only mentions StringIO but does not really explain how to do it. So asking that question here is completely valid, IMHO. Some more trials before asking and telling us about them would have been nice, though.Buttonhole
@Buttonhole - the question was closed 4 years ago for much the same reason as my comment - the OP made no effort to search first.Worried
Of course you are right, @KevinDTimm.Buttonhole
How is this off-topic?Golter
This question is the top hit in google now for gzip string in python and is very reasonable IMO. It should be re-opened.Yasmineyasu
As above, this question is the top result in a google search, and one of the answers is correct - it really seems as though it shouldn't be closed.Rabat
E
168

If you want to produce a complete gzip-compatible binary string, with the header etc, you could use gzip.GzipFile together with StringIO:

try:
    from StringIO import StringIO  # Python 2.7
except ImportError:
    from io import StringIO  # Python 3.x
import gzip
out = StringIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
  f.write("This is mike number one, isn't this a lot of fun?")
out.getvalue()

# returns '\x1f\x8b\x08\x00\xbd\xbe\xe8N\x02\xff\x0b\xc9\xc8,V\x00\xa2\xdc\xcc\xecT\x85\xbc\xd2\xdc\xa4\xd4"\x85\xfc\xbcT\x1d\xa0X\x9ez\x89B\tH:Q!\'\xbfD!?M!\xad4\xcf\x1e\x00w\xd4\xea\xf41\x00\x00\x00'
Easygoing answered 14/12, 2011 at 15:23 Comment(6)
The opposite of this is: `def gunzip_text(text): infile = StringIO.StringIO() infile.write(text) with gzip.GzipFile(fileobj=infile, mode="r") as f: f.rewind() f.read() return out.getvalue()Mnemonic
@fastmultiplication: or shorter: f = gzip.GzipFile(StringIO.StringIO(text)); result = f.read(); f.close(); return resultButtonhole
Unfortunately, the question has been close, so I can't make a new answer, but here is how to do this in Python 3.Yasmineyasu
Probably unrelated, is compressing in memory first faster(using local disk)?Josselyn
In Python 3: import zlib; my_string = "hello world"; my_bytes = zlib.compress(my_string.encode('utf-8')); my_hex = my_bytes.hex(); my_bytes2 = bytes.fromhex(my_hex); my_string2 = zlib.decompress(my_bytes); assert my_string == my_string2;Whole
copying and pasting this into 3.7 iPython fails with TypeError: string argument expected, got 'bytes'Niobe
S
73

The easiest way is the zlib encoding:

compressed_value = s.encode("zlib")

Then you decompress it with:

plain_string_again = compressed_value.decode("zlib")
Septuagesima answered 14/12, 2011 at 15:18 Comment(8)
@Daniel: Yes, s is a Python 2.x object of type str.Septuagesima
See Standard Encodings for where he got that (scroll down to "codecs"). Also available: s.encode('rot13'), s.encode( 'base64' )Coursing
Note that this method is incompatible with the gzip command-line utility in that gzip includes a header and checksum, while this mechanism simply compresses the content.Nor
I know this is old but you line of code for decompressing should be: plain_string_again = compressed_value.decode("zlib")Motherwort
@minillinim: Yes, someone added this slightly wrong code to my answer. Feel free to fix it -- it doesn't matter it's old.Septuagesima
@BenjaminToueg: Python 3 is stricter about the distinction between Unicode strings (type str in Python 3) and byte strings (type bytes). str objects have an encode() method that returns a bytes object, and bytes objects have a decode() method that returns a str. The zlib codec is special in that it converts from bytes to bytes, so it doesn't fit into this structure. You can use codecs.encode(b, "zlib") and codecs.decode(b, "slib") for a bytes object b instead.Septuagesima
How can I direct it into a file?Charleencharlemagne
Beware. This answer is wrong. It does not compress to the gzip format, as asked in the question.Patty
A
49

Python3 version of Sven Marnach's 2011 answer:

import gzip
exampleString = 'abcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijmortenpunnerudengelstadrocksklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuvabcdefghijklmnopqrstuv123'
compressed_value = gzip.compress(bytes(exampleString, 'utf-8'))
plain_string_again = gzip.decompress(compressed_value).decode('utf-8')
Ana answered 8/2, 2019 at 14:40 Comment(4)
In Python 3 zlib is still used, gzip actually uses zlib, see: docs.python.org/3/library/zlib.html and docs.python.org/3/library/gzip.html#module-gzipDisembodied
My original answer was using zlib. Changed to gzip because that was the original question. You can easily replace from gzip to to zlib (search-and-replace) in my example, and it will work.Ana
gzip.decompress returns bytes, so call plain_string_again.decode('utf-8') to get a str objectDisconsider
Unlike Sven Marnach's answer, this answer is correct, in that it produces the gzip format.Patty
W
3

For those who want to compress a Pandas dataframe in JSON format:

Tested with Python 3.6 and Pandas 0.23

import sys
import zlib, lzma, bz2
import math

def convert_size(size_bytes):
    if size_bytes == 0:
        return "0B"
    size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
    i = int(math.floor(math.log(size_bytes, 1024)))
    p = math.pow(1024, i)
    s = round(size_bytes / p, 2)
    return "%s %s" % (s, size_name[i])

dataframe = pd.read_csv('...') # your CSV file
dataframe_json = dataframe.to_json(orient='split')
data = dataframe_json.encode()
compressed_data = bz2.compress(data)
decompressed_data = bz2.decompress(compressed_data).decode()
dataframe_aux = pd.read_json(decompressed_data, orient='split')

#Original data size:  10982455 10.47 MB
#Encoded data size:  10982439 10.47 MB
#Compressed data size:  1276457 1.22 MB (lzma, slow), 2087131 1.99 MB (zlib, fast), 1410908 1.35 MB (bz2, fast)
#Decompressed data size:  10982455 10.47 MB
print('Original data size: ', sys.getsizeof(dataframe_json), convert_size(sys.getsizeof(dataframe_json)))
print('Encoded data size: ', sys.getsizeof(data), convert_size(sys.getsizeof(data)))
print('Compressed data size: ', sys.getsizeof(compressed_data), convert_size(sys.getsizeof(compressed_data)))
print('Decompressed data size: ', sys.getsizeof(decompressed_data), convert_size(sys.getsizeof(decompressed_data)))

print(dataframe.head())
print(dataframe_aux.head())
Waverly answered 24/8, 2018 at 10:30 Comment(0)
B
2

Martin Thoma's answer almost worked: I had to use BytesIO as mentioned in this answer.

from io import BytesIO # Python 3.x, haven't tested 2.7
import gzip
out = BytesIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
  f.write("This is mike number one, isn't this a lot of fun?")
out.getvalue()

The original code produced a TypeError: string argument expected, got 'bytes'

Bovill answered 28/9, 2022 at 14:11 Comment(0)
O
-4
s = "a long string of characters"

g = gzip.open('gzipfilename.gz', 'w', 5) # ('filename', 'read/write mode', compression level)
g.write(s)
g.close()
Oscillogram answered 15/12, 2011 at 1:21 Comment(1)
I guess the question was about compressing a string in memory without having to write it to disk in the process. Otherwise your answer is totally correct.Buttonhole

© 2022 - 2024 — McMap. All rights reserved.