Knowing which package to use and what the corresponding docs are can be a bit confusing, as there appears to be several Python bindings to the actual Zstandard library.
Below, I am referring to the library by Gregory Szorc, that I installed from conda
s default channel with:
conda install zstd
# check:
conda list zstd
# # Name Version Build Channel
# zstd 1.5.5 hc292b87_0
(even though the docs say to install with pip
, which I don't unless there is no other way, as I like my conda environments to remain usable).
I am only inferring that this version is the one from G. Szorc, based on the comments in the __init__.py
file:
# Copyright (c) 2017-present, Gregory Szorc
# All rights reserved.
#
# This software may be modified and distributed under the terms
# of the BSD license. See the LICENSE file for details.
"""Python interface to the Zstandard (zstd) compression library."""
from __future__ import absolute_import, unicode_literals
# This module serves 2 roles:
#
# 1) Export the C or CFFI "backend" through a central module.
# 2) Implement additional functionality built on top of C or CFFI backend.
Thus, I think that the corresponding documentation is here.
In any case, quick test after install:
import zstandard as zstd
with zstd.open('test.zstd', 'w') as f:
for i in range(10_000):
f.write(f'foo {i} bar\n')
with zstd.open('test.zstd', 'r') as f:
for i, line in enumerate(f):
if i % 1000 == 0:
print(f'line {i:4d}: {line}', end='')
Produces:
line 0: foo 0 bar
line 1000: foo 1000 bar
line 2000: foo 2000 bar
line 3000: foo 3000 bar
line 4000: foo 4000 bar
line 5000: foo 5000 bar
line 6000: foo 6000 bar
line 7000: foo 7000 bar
line 8000: foo 8000 bar
line 9000: foo 9000 bar
Notes:
- if the file was written in binary (not text), then use
mode='rb'
, same as a regular file. The underlying file is always written in binary mode, but if we use text mode for open
, then according to open
's doc, "(...) an io.TextIOWrapper
if opened for reading or writing in text mode".
- notice that I use the iterator of
f
, not readlines()
. From the inline docstring, they make it sound like readlines()
returns a list of lines from the file, i.e. the whole thing is slurped in memory. With the iterator, it is more likely that only portions of the file are in memory at any moment (in zstd
's buffer).
- Reading this part of the docs however, I am less sure of the above. Stay tuned... (Edit: tested empirically, it holds, see below).
Addendum
ABout notes 2 and 3 above: I tested empirically, by changing the number of lines to 100 millions and compared the memory usage of two versions (using htop
):
Streaming version
with zstd.open('test.zstd', 'r') as f:
for i, line in enumerate(f):
if i % 10_000_000 == 0:
print(f'line {i:8d}: {line}', end='')
--no bump in memory usage.
Readlines version
with zstd.open('test.zstd', 'r') as f:
for i, line in enumerate(f.readlines()):
if i % 10_000_000 == 0:
print(f'line {i:8d}: {line}', end='')
--bump in memory usage by a few GBs.
This may be specific to the version installed (1.5.5).
if the file was written in binary
Isn't compressed data always in binary? – Erythroblastosis