Read file in chunks - RAM-usage, reading strings from binary files
Asked Answered
G

3

21

I'd like to understand the difference in RAM-usage of this methods when reading a large file in python.

Version 1, found here on stackoverflow:

def read_in_chunks(file_object, chunk_size=1024):
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

f = open(file, 'rb')
for piece in read_in_chunks(f):
    process_data(piece)
f.close()

Version 2, I used this before I found the code above:

f = open(file, 'rb')
while True:
    piece = f.read(1024)
    process_data(piece)
f.close()

The file is read partially in both versions. And the current piece could be processed. In the second example, piece is getting new content on every cycle, so I thought this would do the job without loading the complete file into memory.

But I don't really understand what yield does, and I'm pretty sure I got something wrong here. Could anyone explain that to me?


There is something else that puzzles me, besides of the method used:

The content of the piece I read is defined by the chunk-size, 1KB in the examples above. But... what if I need to look for strings in the file? Something like "ThisIsTheStringILikeToFind"?

Depending on where in the file the string occurs, it could be that one piece contains the part "ThisIsTheStr" - and the next piece would contain "ingILikeToFind". Using such a method it's not possible to detect the whole string in any piece.

Is there a way to read a file in chunks - but somehow care about such strings?

Gape answered 12/6, 2013 at 1:29 Comment(2)
you could write the first fragment as for chunk in iter(partial(f.read, chunk_size), b""): process_data(chunk) (assume binary mode). The answer to the last question is yes: just check whether the chunk ends with any of string's prefixes and the next chunk starts with the corresponding suffix.Fiorenze
Thank you for mentioning iter - didn't know that! About the second question: You mean i could check if the piece ends with T or Th or Thi or This - and so on? Hmm, nice idea! Thanks!Gape
A
25

yield is the keyword in python used for generator expressions. That means that the next time the function is called (or iterated on), the execution will start back up at the exact point it left off last time you called it. The two functions behave identically; the only difference is that the first one uses a tiny bit more call stack space than the second. However, the first one is far more reusable, so from a program design standpoint, the first one is actually better.

EDIT: Also, one other difference is that the first one will stop reading once all the data has been read, the way it should, but the second one will only stop once either f.read() or process_data() throws an exception. In order to have the second one work properly, you need to modify it like so:

f = open(file, 'rb')
while True:
    piece = f.read(1024)  
    if not piece:
        break
    process_data(piece)
f.close()
Addict answered 12/6, 2013 at 1:43 Comment(2)
Thanks for your Answer! I understand that the first version is better reusable, it defines a function that could be useful in other projects, too. The bigger "call stack space" results from this, i guess? Creating a function? But there is no difference in the RAM usage of the file itself? I've found some documentation about generator-functions, it's not that easy to understand when you've common functions in mind all the time - but if i got this right, the first version would return just the first piece of the file and the for-loop would cycle through the data of piece, without yield?Gape
If you liked my answer, could you mark it as the accepted answer? (you actually get 2 rep for doing that)Addict
P
11

starting from python 3.8 you might also use an assignment expression (the walrus-operator):

with open('file.name', 'rb') as file:
    while chunk := file.read(1024):
        process_data(chunk)

the last chunk may be smaller than CHUNK_SIZE.

as read() will return b"" when the file has been read the while loop will terminate.

Procambium answered 24/8, 2020 at 14:41 Comment(1)
Thank you for this info! I'll need to lookup this "walrus-operator", might be helpful to know more about it.Gape
C
9

I think probably the best and most idiomatic way to do this would be to use the built-in iter() function along with its optional sentinel argument to create and use an iterable as shown below. Note that the last chunk might be less that the requested chunk size if the file size isn't an exact multiple of it.

from functools import partial

CHUNK_SIZE = 1024
filename = 'testfile.dat'

with open(filename, 'rb') as file:
    for chunk in iter(partial(file.read, CHUNK_SIZE), b''):
        process_data(chunk)

Update: Don't know when it was added, but almost exactly what's above is in now shown as an example in the official documentation of the iter() function.

Collaboration answered 20/12, 2019 at 16:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.