How can I traverse a file system with a generator?
Asked Answered
I

7

33

I'm trying to create a utility class for traversing all the files in a directory, including those within subdirectories and sub-subdirectories. I tried to use a generator because generators are cool; however, I hit a snag.


def grab_files(directory):
    for name in os.listdir(directory):
        full_path = os.path.join(directory, name)
        if os.path.isdir(full_path):
            yield grab_files(full_path)
        elif os.path.isfile(full_path):
            yield full_path
        else:
            print('Unidentified name %s. It could be a symbolic link' % full_path)

When the generator reaches a directory, it simply yields the memory location of the new generator; it doesn't give me the contents of the directory.

How can I make the generator yield the contents of the directory instead of a new generator?

If there's already a simple library function to recursively list all the files in a directory structure, tell me about it. I don't intend to replicate a library function.

Imperceptive answered 9/11, 2009 at 1:0 Comment(0)
S
65

Why reinvent the wheel when you can use os.walk

import os
for root, dirs, files in os.walk(path):
    for name in files:
        print os.path.join(root, name)

os.walk is a generator that yields the file names in a directory tree by walking the tree either top-down or bottom-up

Susie answered 9/11, 2009 at 1:7 Comment(11)
But then again, by reinventing the wheel we could os.cycle rather than os.walk...Mere
I think it's a joke... "reinventing the wheel"? Walking vs. cycling? Pretty good.. :)Reprove
Yes, Ned, a joke. The suggestion to os.walk() is the way-to-go, unless one is merely trying to learn about generators and uses directory traversal as a practical exercise for it.Mere
@Ned: I literally just facepalmed.Internationalist
os.walk might be a generator, but its granularity is a directory level and the files it returns is a list. If you have a directory with millions of files in it, good luck using os.walk. At least this is true in 2.7.Dantedanton
In addition to what woot pointed out, os.walk also sorts symbolic links into either the directory or file list based on the files they point to. This is fine much of the time, but not if you are trying to operate on the links instead of the linked-to files.Leopold
@Dantedanton - that's exactly why I'm here - I'm trying to split my million files into subdirectories (git object style), but os.listing is taking forever...Gabbi
@Gabbi Look at using scandir. github.com/benhoyt/scandir for python 2.x, else it is built in in python 3.x I think.Dantedanton
Yes, scandir is built into 3 and returns an iterator. listdir still returns a list.Bed
walk() internally uses listdir() so that loses the advantages of using a generator like walk()in the first place. Just use scandir(). Source: hg.python.org/cpython/file/29f0836c0456/Lib/os.py#l276Profession
To be clear, what I said above only applies to Python 2 and versions of Python 3 before 3.5. The maintained versions of Python no longer have that issue with walk(): github.com/python/cpython/blob/…Profession
R
16

As of Python 3.4, you can use the glob() method from the built-in pathlib module:

import pathlib
p = pathlib.Path('.')
list(p.glob('**/*'))    # lists all files recursively
Rim answered 22/4, 2017 at 18:45 Comment(1)
Just to confirm, type(p.glob('**/*')) indeed returns generator.Wraf
K
15

I agree with the os.walk solution

For pure pedantic purpose, try iterate over the generator object, instead of returning it directly:


def grab_files(directory):
    for name in os.listdir(directory):
        full_path = os.path.join(directory, name)
        if os.path.isdir(full_path):
            for entry in grab_files(full_path):
                yield entry
        elif os.path.isfile(full_path):
            yield full_path
        else:
            print('Unidentified name %s. It could be a symbolic link' % full_path)
Khasi answered 9/11, 2009 at 1:43 Comment(1)
Thanks for the example. I figured out this solution about five minutes after I had posted the question. XDImperceptive
E
11

Starting with Python 3.4, you can use the Pathlib module:

In [48]: def alliter(p):
   ....:     yield p
   ....:     for sub in p.iterdir():
   ....:         if sub.is_dir():
   ....:             yield from alliter(sub)
   ....:         else:
   ....:             yield sub
   ....:             

In [49]: g = alliter(pathlib.Path("."))                                                                                                                                                              

In [50]: [next(g) for _ in range(10)]
Out[50]: 
[PosixPath('.'),
 PosixPath('.pypirc'),
 PosixPath('.python_history'),
 PosixPath('lshw'),
 PosixPath('.gstreamer-0.10'),
 PosixPath('.gstreamer-0.10/registry.x86_64.bin'),
 PosixPath('.gconf'),
 PosixPath('.gconf/apps'),
 PosixPath('.gconf/apps/gnome-terminal'),
 PosixPath('.gconf/apps/gnome-terminal/%gconf.xml')]

This is essential the object-oriented version of sjthebats answer. Note that the Path.glob ** pattern returns only directories!

Executrix answered 8/4, 2014 at 21:9 Comment(3)
For people dealing with many files in directories, I believe this is the only truly iterative solution on this answer and possibly the only high-level way in the python(3) standard library. It should probably be added as an option to iterdir().Botswana
@Botswana Isn't yield from alliter(sub) within a generator alliter rather recursive than iterative?Executrix
You are right. What I mean is that it gives you results without first doing a full stat on all the files in a directory. So even when you have a large number of files it can generate results immediately.Botswana
P
2

os.scandir() is a "function returns directory entries along with file attribute information, giving better performance [than os.listdir()] for many common use cases." It's an iterator that does not use os.listdir() interally.

Profession answered 3/11, 2020 at 19:4 Comment(0)
M
0

You can use path.py. Unfortunately the author's website is no longer around, but you can still download the code from PyPI. This library is a wrapper around path functions in the os module.

path.py provides a walkfiles() method which returns a generator iterating recursively over all files in the directory:

>>> from path import path
>>> print path.walkfiles.__doc__
 D.walkfiles() -> iterator over files in D, recursively.

        The optional argument, pattern, limits the results to files
        with names that match the pattern.  For example,
        mydir.walkfiles('*.tmp') yields only files with the .tmp
        extension.

>>> p = path('/tmp')
>>> p.walkfiles()
<generator object walkfiles at 0x8ca75a4>
>>> 
Mulvihill answered 9/11, 2009 at 1:10 Comment(0)
P
0

addendum to the answer of gerrit. I wanted to make something more flexible.

list all files in pth matching a given pattern, can also list dirs if only_file is False

from pathlib import Path

def walk(pth=Path('.'), pattern='*', only_file=True) :
    """ list all files in pth matching a given pattern, can also list dirs if only_file is False """
    if pth.match(pattern) and not (only_file and pth.is_dir()) :
        yield pth
    for sub in pth.iterdir():
        if sub.is_dir():
            yield from walk(sub, pattern, only_file)
        else:
            if sub.match(pattern) :
                yield sub
Puett answered 9/5, 2016 at 13:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.