Python os.walk memory issue
Asked Answered
S

3

7

I programmed a scanner that looks for certain files on all hard drives of a system that gets scanned. Some of these systems are pretty old, running Windows 2000 with 256 or 512 MB of RAM but the file system structure is complex as some of them serve as file servers.

I use os.walk() in my script to parse all directories and files.

Unfortunately we noticed that the scanner consumes a lot of RAM after some time of scanning and we figured out that the os.walk function alone uses about 50 MB of RAM after 2h of walk over the file system. This RAM usage increases over the time. We had about 90 MB of RAM after 4 hours of scanning.

Is there a way to avoid this behaviour? We also tried "betterwalk.walk()" and "scandir.walk()". The result was the same. Do we have to write our own walk function that removes already scanned directory and file objects from memory so that the garbage collector can remove them from time to time?

resource usage over time - second row is memory

Thanks

Syringa answered 29/6, 2014 at 7:54 Comment(6)
I know there was a memory leak at os.path.isdir which is used at os.walk implementation you can read about it at this post as far as i know it was fixed at python 3, see the leak report hereLodged
A workaround is to use a unicode path.Such
Python version 2.7.4 contains the fix, so upgrading your Python version should also help.Such
I use version 2.7.7 and it is still the way I described it. Maybe it is not the same issue? I will try to use the unicode representation.Syringa
can you reproduce it on Linux, OSX?Luscious
Martijn said: "A workaround is to use a unicode path." But how? os.walk returns values ... Passing a unicode path string to os.walk does not change a thing.Syringa
O
1

have you tried the glob module?

import os, glob

def globit(srchDir):
    srchDir = os.path.join(srchDir, "*")
    for file in glob.glob(srchDir):
        print file
        globit(file)

if __name__ == '__main__':
    dir = r'C:\working'
    globit(dir)
Ostrander answered 4/8, 2014 at 18:4 Comment(1)
Even better if you turn it into a generator.Forsterite
C
1

If you are running in the os.walk loop, del() everything that you don't need anymore. And try running gc.collect() at the end of every iteration of os.walk.

Cavit answered 4/8, 2014 at 20:11 Comment(0)
M
0

Generators are better solutions as they do lazy computations here is one example of implementation.

import os
import fnmatch

#this may or may not be implemented
def list_dir(path):
    for name in os.listdir(path):
        yield os.path.join(path, name)

#modify this to take some pattern as input 
def os_walker(top):
    for root,dlist,flist in os.walk(top):
        for name in fnmatch.filter(flist, '*.py'):
            yield os.path.join(root, name)

all_dirs = list_dir("D:\\tuts\\pycharm")

for l in all_dirs:
    for name in os_walker(l):
        print(name)

Thanks to David Beazley

Maharaja answered 8/7, 2017 at 14:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.