Quicker to os.walk or glob?
Asked Answered
H

5

48

I'm messing around with file lookups in python on a large hard disk. I've been looking at os.walk and glob. I usually use os.walk as I find it much neater and seems to be quicker (for usual size directories).

Has anyone got any experience with them both and could say which is more efficient? As I say, glob seems to be slower, but you can use wildcards etc, were as with walk, you have to filter results. Here is an example of looking up core dumps.

core = re.compile(r"core\.\d*")
for root, dirs, files in os.walk("/path/to/dir/")
    for file in files:
        if core.search(file):
            path = os.path.join(root,file)
            print "Deleting: " + path
            os.remove(path)

Or

for file in iglob("/path/to/dir/core.*")
    print "Deleting: " + file
    os.remove(file)
Hussy answered 19/1, 2012 at 18:10 Comment(1)
Sounds like premature optimization to me. I glanced at the source (hg.python.org/cpython/file/d01208ba482f/Lib/glob.py and hg.python.org/cpython/file/d01208ba482f/Lib/os.py) and see that both functions rely on os.listdir and os.isdir, so my gut tells me you won't gain much one way or the other. (However, as pointed out in two of the answers below, the os.walk recurses over subdirectories and glob.iglob doesn't, so it doesn't make sense to compare). If you do end up with a performance issue, profile a couple of approaches. Otherwise, just write clear code.Herzig
H
53

I made a research on a small cache of web pages in 1000 dirs. The task was to count a total number of files in dirs. The output is:

os.listdir: 0.7268s, 1326786 files found
os.walk: 3.6592s, 1326787 files found
glob.glob: 2.0133s, 1326786 files found

As you see, os.listdir is quickest of three. And glog.glob is still quicker than os.walk for this task.

The source:

import os, time, glob

n, t = 0, time.time()
for i in range(1000):
    n += len(os.listdir("./%d" % i))
t = time.time() - t
print "os.listdir: %.4fs, %d files found" % (t, n)

n, t = 0, time.time()
for root, dirs, files in os.walk("./"):
    for file in files:
        n += 1
t = time.time() - t
print "os.walk: %.4fs, %d files found" % (t, n)

n, t = 0, time.time()
for i in range(1000):
    n += len(glob.glob("./%d/*" % i))
t = time.time() - t
print "glob.glob: %.4fs, %d files found" % (t, n)
Horripilation answered 19/12, 2014 at 11:43 Comment(4)
Isn't os.walk lazy (generator) while glob will create a large list in-memory?Magen
This does not run through the file tree recursively.Audiophile
glob.iglob will return a generator, python 2 docs.python.org/2/library/glob.html#glob.iglob, python 3 docs.python.org/3/library/glob.html#glob.iglobKrohn
This is fixed for os.walk in Python 3.5+, as mentioned here: docs.python.org/3/library/os.html#os.walk This function now calls os.scandir() instead of os.listdir(), making it faster by reducing the number of calls to os.stat().Inbreed
U
2

You can use os.walk and still use glob-style matching.

for root, dirs, files in os.walk(DIRECTORY):
    for file in files:
        if glob.fnmatch.fnmatch(file, PATTERN):
            print file

Not sure about speed, but obviously since os.walk is recursive, they do different things.

Utas answered 19/1, 2012 at 18:25 Comment(0)
I
0

*, ?, and character ranges expressed with [] will be correctly matched. This is done by using the os.listdir() and fnmatch.fnmatch() functions

I think even with glob you would still have to os.walk, unless you know directly how deep your subdirectory tree is.

Btw. in the glob documentation it says:

"*, ?, and character ranges expressed with [] will be correctly matched. This is done by using the os.listdir() and fnmatch.fnmatch() functions "

I would simply go with a

for path, subdirs, files in os.walk(path):
        for name in fnmatch.filter(files, search_str):
            shutil.copy(os.path.join(path,name), dest)
Irrelevant answered 26/4, 2014 at 4:49 Comment(0)
F
0

The part that never seems to be mentioned when comparing os.walk to any of the other methods is that it has different functionality. Scandir is the base implementation that has the fastest performance for just getting at the contents of a directory. Listdir is for a simple list of directory contents. glob is a potentially recursive pattern matching operation. Walk is a recursive traversal of the directory, with complete user control.

  • the user can decide what order to traverse with topdown=True|False
  • the user can prune the tree during traversal (using topdown=True) by removing directories from the list of subdirectories for any given directory
  • the user can perform arbitrary actions on files and directories during traversal.
  • combining control of traversal order and file removable, the user can use topdown=False and remove files and then if the directory is empty the directory can be removed.

The power of walk as a general purpose tool has a cost. If all you need is listdir or glob then those should be used and not walk. But, when you need the power of walk, the others just will not do.

Foghorn answered 16/6, 2023 at 19:6 Comment(0)
C
-3

Don't waste your time for optimization before measuring/profiling. Focus on making your code simple and easy to maintain.

For example, in your code you precompile RE, which does not give you any speed boost, because re module has internal re._cache of precompiled REs.

  1. Keep it simple
  2. if it's slow, then profile
  3. once you know exactly what needs to be optimized do some tweaks and always document it

Note, that some optimization done several years prior can make code run slower compared to "non-optimized" code. This applies especially for modern JIT based languages.

Cardio answered 19/1, 2012 at 18:29 Comment(4)
-1. OP mentioned a "large disk". Also, the code is obviously simple already. Moreover, OP seems to be at the stage of optimizing. It's a plague on SO to discard questions related to performance with something like "premature optimizations are root of blabla" (which are actually misquotations of Knuth).Widthwise
-1 optimization is important in the real (professional) world, where things are often at a very large scale. don't just blindly diss optimization without any rational reasonOosphere
Premature optimization IS stupid. It makes code almost always harder to maintain and sometimes even makes it to perform worse. I don't say this is the case, but it may be.Village
Made no sense here. Nonsense. Optimization here is of course important.Millisent

© 2022 - 2024 — McMap. All rights reserved.