Python readlines() usage and efficient practice for reading
Asked Answered
C

2

49

I have a problem to parse 1000's of text files(around 3000 lines in each file of ~400KB size ) in a folder. I did read them using readlines,

   for filename in os.listdir (input_dir) :
       if filename.endswith(".gz"):
          f = gzip.open(file, 'rb')
       else:
          f = open(file, 'rb')

       file_content = f.readlines()
       f.close()
   len_file = len(file_content)
   while i < len_file:
       line = file_content[i].split(delimiter) 
       ... my logic ...  
       i += 1  

This works completely fine for sample from my inputs (50,100 files) . When I ran on the whole input more than 5K files, the time-taken was nowhere close to linear increment.I planned to do an performance analysis and did a Cprofile analysis. The time taken for the more files in exponentially increasing with reaching worse rates when inputs reached to 7K files.

Here is the the cumulative time-taken for readlines , first -> 354 files(sample from input) and second -> 7473 files (whole input)

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 354    0.192    0.001    **0.192**    0.001 {method 'readlines' of 'file' objects}
 7473 1329.380    0.178  **1329.380**    0.178 {method 'readlines' of 'file' objects}

Because of this, the time-taken by my code is not linearly scaling as the input increases. I read some doc notes on readlines(), where people has claimed that this readlines() reads whole file content into memory and hence generally consumes more memory compared to readline() or read().

I agree with this point, but should the garbage collector automatically clear that loaded content from memory at the end of my loop, hence at any instant my memory should have only the contents of my currently processed file right ? But, there is some catch here. Can somebody give some insights into this issue.

Is this an inherent behavior of readlines() or my wrong interpretation of python garbage collector. Glad to know.

Also, suggest some alternative ways of doing the same in memory and time efficient manner. TIA.

Clown answered 22/6, 2013 at 0:48 Comment(4)
As a side note, there is never a good reason to write len_file = len(file_content), then a while( i < len_file ): loop with i += 1 and file_content[i] inside. Just use for line in file_content:. If you also need i for something else, use for i, line in enumerate(file_content). You're making things harder for yourself and your readers (and for the interpreter, which means your code may run slower, but that's usually much less important here).Ageratum
Thanks @abarnert. I'll change them .Clown
One last style note: In Python, you can just write if filename.endswith(".gz"):; you don't need parentheses around the condition, and shouldn't use them. One of the great things about Python is how easy it is both to skim quickly and to read in-depth, but putting in those parentheses makes it much harder to skim (because you have to figure out whether there's a multi-line expression, a tuple, a genexp, or just code written by a C/Java/JavaScript programmer).Ageratum
Nice tip,duly noted. Will change them as well.Clown
A
105

The short version is: The efficient way to use readlines() is to not use it. Ever.


I read some doc notes on readlines(), where people has claimed that this readlines() reads whole file content into memory and hence generally consumes more memory compared to readline() or read().

The documentation for readlines() explicitly guarantees that it reads the whole file into memory, and parses it into lines, and builds a list full of strings out of those lines.

But the documentation for read() likewise guarantees that it reads the whole file into memory, and builds a string, so that doesn't help.


On top of using more memory, this also means you can't do any work until the whole thing is read. If you alternate reading and processing in even the most naive way, you will benefit from at least some pipelining (thanks to the OS disk cache, DMA, CPU pipeline, etc.), so you will be working on one batch while the next batch is being read. But if you force the computer to read the whole file in, then parse the whole file, then run your code, you only get one region of overlapping work for the entire file, instead of one region of overlapping work per read.


You can work around this in three ways:

  1. Write a loop around readlines(sizehint), read(size), or readline().
  2. Just use the file as a lazy iterator without calling any of these.
  3. mmap the file, which allows you to treat it as a giant string without first reading it in.

For example, this has to read all of foo at once:

with open('foo') as f:
    lines = f.readlines()
    for line in lines:
        pass

But this only reads about 8K at a time:

with open('foo') as f:
    while True:
        lines = f.readlines(8192)
        if not lines:
            break
        for line in lines:
            pass

And this only reads one line at a time—although Python is allowed to (and will) pick a nice buffer size to make things faster.

with open('foo') as f:
    while True:
        line = f.readline()
        if not line:
            break
        pass

And this will do the exact same thing as the previous:

with open('foo') as f:
    for line in f:
        pass

Meanwhile:

but should the garbage collector automatically clear that loaded content from memory at the end of my loop, hence at any instant my memory should have only the contents of my currently processed file right ?

Python doesn't make any such guarantees about garbage collection.

The CPython implementation happens to use refcounting for GC, which means that in your code, as soon as file_content gets rebound or goes away, the giant list of strings, and all of the strings within it, will be freed to the freelist, meaning the same memory can be reused again for your next pass.

However, all those allocations, copies, and deallocations aren't free—it's much faster to not do them than to do them.

On top of that, having your strings scattered across a large swath of memory instead of reusing the same small chunk of memory over and over hurts your cache behavior.

Plus, while the memory usage may be constant (or, rather, linear in the size of your largest file, rather than in the sum of your file sizes), that rush of mallocs to expand it the first time will be one of the slowest things you do (which also makes it much harder to do performance comparisons).


Putting it all together, here's how I'd write your program:

for filename in os.listdir(input_dir):
    with open(filename, 'rb') as f:
        if filename.endswith(".gz"):
            f = gzip.open(fileobj=f)
        words = (line.split(delimiter) for line in f)
        ... my logic ...  

Or, maybe:

for filename in os.listdir(input_dir):
    if filename.endswith(".gz"):
        f = gzip.open(filename, 'rb')
    else:
        f = open(filename, 'rb')
    with contextlib.closing(f):
        words = (line.split(delimiter) for line in f)
        ... my logic ...
Ageratum answered 22/6, 2013 at 0:55 Comment(6)
I should have told this earlier. My inputs directory might contain gzip file and also normal text file - so for file open i'am using a if else construct. I'm afraid this 'with' might not work out.Clown
@Learner: Sure it will: with open('foo', 'rb') as f:, then you can create a GzipFile(fileobj=f) if necessary (or an io.IOTextWrapper if it's a text file you want decoded to unicode, or a csv.reader if it's a CSV file you want decoded to rows, etc.). At any rate, the with part isn't relevant here; all of the options are exactly the same options with explicit close, except more verbose and less robust.Ageratum
I'm not sure if I understood the iotextwrapper part. Any links to follow ? TIA :)Clown
@Learner: I'm assuming you're using Python 2, yes? If so, the reference docs are here, and the way to learn is… read the differences between Python 2 text files and Python 3 text files (maybe start here); io.TextIOWrapper turns the former into the latter, so you can write clean Py3-style code that only deals with unicode objects, not encoded bytes, even in Py2.Ageratum
Thanks @abarnert, I used the last method you quoted with contextlib.closing() - worked great. My results for the time-taken reduced and also, the program scaled linearly in time, even as the inputs grew :)Clown
@Learner: Glad it helped. closing isn't useful that often—most of the time, you've just got a file or something else that can be used directly in a with statement—but it is handy to know for cases like this. Anyway, the important part (the part that sped up your code) is using the file (or GzipFile) directly as an iterable, instead of readline()-ing the whole thing into memory to use the list as an iterable, as Óscar López explained before me.Ageratum
B
19

Read line by line, not the whole file:

for line in open(file_name, 'rb'):
    # process line here

Even better use with for automatically closing the file:

with open(file_name, 'rb') as f:
    for line in f:
        # process line here

The above will read the file object using an iterator, one line at a time.

Brew answered 22/6, 2013 at 0:49 Comment(17)
I understand what you mean by this. I would like to know the root-cause as well.What is the reason behind readlines() being slow ? Thanks !Clown
That readlines will read the whole file at once into a list, which can be a problem if it's big - it'll use a lot of memory!Inessa
Yeah, but given that my file is 400KB (<0.5MB) and will be disposed from memory at the end of every iteration, readlines() reading the whole file shouldn't be a problem right ?Clown
But anyway you'll be creating a lot of potentially big lists that get discarded immediately, but not really freed from memory until the next run of the garbage collector. In Python, the preferred style is using iterators, generator expressions, etc. - never create a new, big object when you can process little chunks of it at a timeInessa
@ÓscarLópez: Actually, at least in CPython, the GC generally frees up memory (not back to the OS, but to the internal free list) as soon as the name referencing it gets rebound or goes away, so that first part isn't really an issue. But your larger point is 100% right. Iterators make everything better.Ageratum
oh oh ! so, they will be still be consuming part of my program's memory and hence will slow the running once the number of files gets increasing ? Also, this will be cleared only after my programs ends running ?Clown
@Learner: No, that's probably not the problem.Ageratum
Yes, you'll be consuming memory and eventually you'll start paging into disk if the physical memory runs out. And no, the GC is not deterministic, so you can't tell when the memory is going to be freed - in fact, part of the reasons for the slowdown could be the GC runningInessa
@ÓscarLópez: Yes, the GC is deterministic in the CPython implementation, which the OP is almost certainly using (since he would have said Jython or Iron or PyPy if he were using them).Ageratum
@Ageratum can you please provide a reference stating that is, in fact, deterministic?Inessa
@ÓscarLópez: docs.python.org/2/c-api/intro.html#reference-counts documents how the refcounting works. (The documentation on cycle breaking is elsewhere, but not relevant here.) The proof that it's deterministic is trivial: a pure refcounting GC is deterministic by definition (and a refcounting-plus-cycle-breaking GC is likewise deterministic when there are no cycles).Ageratum
@ÓscarLópez: Do you really not believe that CPython is refcounted, or that refcounting is deterministic, or are you just being a stickler here?Ageratum
Paging is one problem I was expecting this to run-into.Clown
@Ageratum I'm just curious. The link you provided is interesting, but I'm left wondering how frequently does the GC run. Sure, the ref counting algorithm is deterministic, but can we predict when will it run? that's why I'm stating that it's non-deterministic - you don't know when it'll reclaim memoryInessa
@ÓscarLópez: The whole point of refcounting is that it doesn't have to run. Every time a reference goes away (e.g., a name is rebound or goes out of scope), the count on the referenced object is decreased, and if it reaches 0, the object is reclaimed immediately. (The cycle detector is another, more complicated story, but again, it's not relevant here, because there are no cycles in the OP's code.) The Wikipedia article explains it pretty well.Ageratum
@Ageratum thanks for clarifying that, I learnt something new :)Inessa
@ÓscarLópez: Just keep in mind that this is only a feature of CPython, not all Python implementations (e.g., Jython doesn't refcount, and relies on the Java generational GC), and that even with CPython it's not always obvious when there are cycles (especially in an interactive session or the debugger), so your habits of using with whenever possible, etc. are definitely worth keeping.Ageratum

© 2022 - 2024 — McMap. All rights reserved.