I'd like to understand the difference in RAM-usage of this methods when reading a large file in python.
Version 1, found here on stackoverflow:
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open(file, 'rb')
for piece in read_in_chunks(f):
process_data(piece)
f.close()
Version 2, I used this before I found the code above:
f = open(file, 'rb')
while True:
piece = f.read(1024)
process_data(piece)
f.close()
The file is read partially in both versions. And the current piece could be processed. In the second example, piece
is getting new content on every cycle, so I thought this would do the job without loading the complete file into memory.
But I don't really understand what yield
does, and I'm pretty sure I got something wrong here. Could anyone explain that to me?
There is something else that puzzles me, besides of the method used:
The content of the piece I read is defined by the chunk-size, 1KB in the examples above. But... what if I need to look for strings in the file? Something like "ThisIsTheStringILikeToFind"
?
Depending on where in the file the string occurs, it could be that one piece contains the part "ThisIsTheStr"
- and the next piece would contain "ingILikeToFind"
. Using such a method it's not possible to detect the whole string in any piece.
Is there a way to read a file in chunks - but somehow care about such strings?
for chunk in iter(partial(f.read, chunk_size), b""): process_data(chunk)
(assume binary mode). The answer to the last question is yes: just check whether the chunk ends with any of string's prefixes and the next chunk starts with the corresponding suffix. – Fiorenzeiter
- didn't know that! About the second question: You mean i could check if the piece ends withT
orTh
orThi
orThis
- and so on? Hmm, nice idea! Thanks! – Gape