I have a problem to parse 1000's of text files(around 3000 lines in each file of ~400KB size ) in a folder. I did read them using readlines,
for filename in os.listdir (input_dir) :
if filename.endswith(".gz"):
f = gzip.open(file, 'rb')
else:
f = open(file, 'rb')
file_content = f.readlines()
f.close()
len_file = len(file_content)
while i < len_file:
line = file_content[i].split(delimiter)
... my logic ...
i += 1
This works completely fine for sample from my inputs (50,100 files) . When I ran on the whole input more than 5K files, the time-taken was nowhere close to linear increment.I planned to do an performance analysis and did a Cprofile analysis. The time taken for the more files in exponentially increasing with reaching worse rates when inputs reached to 7K files.
Here is the the cumulative time-taken for readlines , first -> 354 files(sample from input) and second -> 7473 files (whole input)
ncalls tottime percall cumtime percall filename:lineno(function)
354 0.192 0.001 **0.192** 0.001 {method 'readlines' of 'file' objects}
7473 1329.380 0.178 **1329.380** 0.178 {method 'readlines' of 'file' objects}
Because of this, the time-taken by my code is not linearly scaling as the input increases. I read some doc notes on readlines()
, where people has claimed that this readlines()
reads whole file content into memory and hence generally consumes more memory compared to readline()
or read()
.
I agree with this point, but should the garbage collector automatically clear that loaded content from memory at the end of my loop, hence at any instant my memory should have only the contents of my currently processed file right ? But, there is some catch here. Can somebody give some insights into this issue.
Is this an inherent behavior of readlines()
or my wrong interpretation of python garbage collector. Glad to know.
Also, suggest some alternative ways of doing the same in memory and time efficient manner. TIA.
len_file = len(file_content)
, then awhile( i < len_file ):
loop withi += 1
andfile_content[i]
inside. Just usefor line in file_content:
. If you also needi
for something else, usefor i, line in enumerate(file_content)
. You're making things harder for yourself and your readers (and for the interpreter, which means your code may run slower, but that's usually much less important here). – Ageratumif filename.endswith(".gz"):
; you don't need parentheses around the condition, and shouldn't use them. One of the great things about Python is how easy it is both to skim quickly and to read in-depth, but putting in those parentheses makes it much harder to skim (because you have to figure out whether there's a multi-line expression, a tuple, a genexp, or just code written by a C/Java/JavaScript programmer). – Ageratum