Read a large zipped text file line by line in python

Asked 14/7, 2012 at 8:48 Answered 11/6, 2022 at 10:29

I am trying to use zipfile module to read a file in an archive. the uncompressed file is ~3GB and the compressed file is 200MB. I don't want them in memory as I process the compressed file line by line. So far I have noticed a memory overuse using the following code:

import zipfile
f = open(...)
z = zipfile.ZipFile(f)
for line in zipfile.open(...).readlines()
  print line

I did it in C# using the SharpZipLib:

var fStream = File.OpenRead("...");
var unzipper = new ICSharpCode.SharpZipLib.Zip.ZipFile(fStream);
var dataStream =  unzipper.GetInputStream(0);

dataStream is uncompressed. I can't seem to find a way to do it in Python. Help will be appreciated.

Ossetic answered 14/7, 2012 at 8:48 Comment(0)

Python file objects provide iterators, which will read line by line. file.readlines() reads them all and returns a list - which means it needs to read everything into memory. The better approach (which should always be preferred over readlines()) is to just loop over the object itself, E.g:

import zipfile
with zipfile.ZipFile(...) as z:
    with z.open(...) as f:
        for line in f:
            print line

Note my use of the with statement - file objects are context managers, and the with statement lets us easily write readable code that ensures files are closed when the block is exited (even upon exceptions). This, again, should always be used when dealing with files.

Blindage answered 14/7, 2012 at 8:50 Comment(6)

couldn't say better than that – Fallacious 14/7, 2012 at 8:55

@Gareth Latty, is there a documention on explanation of what type of parameters, the open function takes? I would like to see if I can set a memory buffer for the open () just like you can with the "with open()" function – Cord 21/7, 2020 at 19:22

The other thing I noticed is that z.open() does not seem to allow an r option. This comes into play when you need to run some logic in the for line in f: block. Example: if line.find("YES") != -1: print('yay'). This returns a TypeError. You have to put a b in front of the "YES" to make it work. – Pointless 6/1, 2021 at 23:58

@Pointless That's because you are getting back bytes, not a unicode string. Depending on the use case, you probably want to do something like decode it as UTF-8 to get a real string instead of just using byte strings. – Blindage 7/1, 2021 at 0:22

Ok. I'm having trouble finding where in the function to put the .decode(). Or do I wrap it around the function call? I'll experiment. – Pointless 7/1, 2021 at 0:29

use io.TextIOWrapper; eg with io.TextIOWrapper(z.open(...), encoding='utf-8') as f: – Colorcast 22/1, 2021 at 5:53

If the inner directory and the subdirectory filenames in the zipped file don't matter, you can try this:

from zipfile import ZipFile
from io import TextIOWrapper

def zip_open(filename):
    """Wrapper function that for zipfiles."""
    with ZipFile(filename) as zipfin:
        for filename in zipfin.namelist():
            return TextIOWrapper(zipfin.open(filename))

# Usage of the zip_open function)
with zip_open('myzipball.zip') as fin:
    for line in fin:
        print(line)

The zip_open works well when the zipfile contains a single or multiple files without subdirectories. Not sure if the simple for filename in zipfin.namelist() works if there are complex subdirectories structure in the zipped file though.

Legpull answered 11/6, 2022 at 10:29 Comment(0)

Recommended topics

Hot tags