Python Walk, but Thread Lightly
Asked Answered
A

3

5

I'd like to recursively walk a directory, but I want python to break from any single listdir if it encounters a directory with greater than 100 files. Basically, I'm searching for a (.TXT) file, but I want to avoid directories with large DPX image sequences (usually 10,000 files). Since DPXs live in directories by themselves with no sub directories, I'd like to break that loop ASAP.

So long story short, if python encounters a file matching ".DPX$" it stops listing the sub-directory, backs out, skips that sub-directory and continues the walk in other sub-directories.

Is this possible to break a directory listing loop before all the list results are returned?

Alephnull answered 4/5, 2012 at 18:51 Comment(4)
Is there anything distinct about the directory names containing DPX image sequences?Yellowwood
If you want to read large directories incrementally (ie. not just stop recursion, but not even read their individual contents), you might need to use something like the solutions described at #4404098Concoction
Some directories have 'dpx' in the name, but not all of them :( @charles, will that example work for me. I want to break out of a listing if I cross a DPX, this way I could avoid iterating through 100,000 file names, which takes a long time.Alephnull
@Alephnull the answer in the other question which uses ctypes should point you in the right direction, yes. Note the warnings about it not being portable.Concoction
M
1

The right way to avoid allocating the list of names using the os.listdir is to use the os level function as @Charles Duffy said.

Inspired from this other post: List files in a folder as a stream to begin process immediately

I added how to solve the specific OP question and used the re-entrant version of the function.

from ctypes import CDLL, c_char_p, c_int, c_long, c_ushort, c_byte, c_char, Structure, POINTER, byref, cast, sizeof, get_errno
from ctypes.util import find_library

class c_dir(Structure):
    """Opaque type for directory entries, corresponds to struct DIR"""
    pass

class c_dirent(Structure):
    """Directory entry"""
    # FIXME not sure these are the exactly correct types!
    _fields_ = (
        ('d_ino', c_long), # inode number
        ('d_off', c_long), # offset to the next dirent
        ('d_reclen', c_ushort), # length of this record
        ('d_type', c_byte), # type of file; not supported by all file system types
        ('d_name', c_char * 4096) # filename
        )
c_dirent_p = POINTER(c_dirent)
c_dirent_pp = POINTER(c_dirent_p)
c_dir_p = POINTER(c_dir)

c_lib = CDLL(find_library("c"))
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p

readdir_r = c_lib.readdir_r
readdir_r.argtypes = [c_dir_p, c_dirent_p, c_dirent_pp]
readdir_r.restype = c_int

closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int

import errno

def listdirx(path):
    """
    A generator to return the names of files in the directory passed in
    """
    dir_p = opendir(path)

    if not dir_p:
        raise IOError()

    entry_p = cast(c_lib.malloc(sizeof(c_dirent)), c_dirent_p)

    try:
        while True:
            res = readdir_r(dir_p, entry_p, byref(entry_p))
            if res:
                raise IOError()
            if not entry_p:
                break
            name = entry_p.contents.d_name
            if name not in (".", ".."):
                yield name
    finally:
        if dir_p:
            closedir(dir_p)
        if entry_p:
            c_lib.free(entry_p)

if __name__ == '__main__':
    import sys
    path = sys.argv[1]
    max_per_dir = int(sys.argv[2])
    for idx, entry in enumerate(listdirx(path)):
        if idx >= max_per_dir:
            break
        print entry
Manganite answered 4/5, 2012 at 22:3 Comment(2)
So instead of "if idx >= max_per_dir:" replace with: "if re.search('\.DPX$',entry):" Is it that simple?Alephnull
yes if you find one file that ends with .DPX you can ignore that directory. But the function is not recursive, it will only iterate over a single path.Manganite
S
4

If by 'directory listing loop' you mean os.listdir() then no. This cannot be broken from. You could however look at the os.path.walk() or os.walk() methods and just remove all the directories which contain DPX files. If you use os.walk() and are walking top-down you can affect what direcotries Python walks into by just modifying the list of directories. os.path.walk() allows you to choose where you walk with the visit method.

Sweatshop answered 4/5, 2012 at 19:5 Comment(3)
Notably -- there are alternatives to os.listdir() (ie. using ctypes to invoke the underlying system call) which can be done incrementally.Concoction
How can I know if a directory has a DPX file in it while avoiding reading every file in the directory. It takes 30 mins to simply list the directories with DPXs inside. For Example: root_dir/: -file.txt -subdir1/ --file1.txt --file2.txt --file3.txt -subdir2/ --file1.txt --file2.dpx ***BREAK LOOP*** --subdir3/ --file1.txt --file2.txt --file3.txt Alephnull
Using ctypes and re-entrant reading of the directory are probably your best bet as @Charles said. Or you could consider writing a specialist directory listing function as a c python module and importing it. Some form of re-entrant listing in c, raising an exception if a DPX file is found, imported as a module would be the fastest solution however potentially more complex than a python only solution. Potentially not though.Sweatshop
G
2

According to the documentation for os.walk:

When topdown is True, the caller can modify the dirnames list in-place (e.g., via del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, or to impose a specific order of visiting. Modifying dirnames when topdown is False is ineffective, since the directories in dirnames have already been generated by the time dirnames itself is generated.

So in theory if you empty out dirnames then os.walk will not recurse down any additional directories. Note the comment about "...via del or slice assignment"; you cannot simply do dirnames=[] because this won't actually affect the contents of the dirnames list.

Guild answered 4/5, 2012 at 19:13 Comment(0)
M
1

The right way to avoid allocating the list of names using the os.listdir is to use the os level function as @Charles Duffy said.

Inspired from this other post: List files in a folder as a stream to begin process immediately

I added how to solve the specific OP question and used the re-entrant version of the function.

from ctypes import CDLL, c_char_p, c_int, c_long, c_ushort, c_byte, c_char, Structure, POINTER, byref, cast, sizeof, get_errno
from ctypes.util import find_library

class c_dir(Structure):
    """Opaque type for directory entries, corresponds to struct DIR"""
    pass

class c_dirent(Structure):
    """Directory entry"""
    # FIXME not sure these are the exactly correct types!
    _fields_ = (
        ('d_ino', c_long), # inode number
        ('d_off', c_long), # offset to the next dirent
        ('d_reclen', c_ushort), # length of this record
        ('d_type', c_byte), # type of file; not supported by all file system types
        ('d_name', c_char * 4096) # filename
        )
c_dirent_p = POINTER(c_dirent)
c_dirent_pp = POINTER(c_dirent_p)
c_dir_p = POINTER(c_dir)

c_lib = CDLL(find_library("c"))
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p

readdir_r = c_lib.readdir_r
readdir_r.argtypes = [c_dir_p, c_dirent_p, c_dirent_pp]
readdir_r.restype = c_int

closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int

import errno

def listdirx(path):
    """
    A generator to return the names of files in the directory passed in
    """
    dir_p = opendir(path)

    if not dir_p:
        raise IOError()

    entry_p = cast(c_lib.malloc(sizeof(c_dirent)), c_dirent_p)

    try:
        while True:
            res = readdir_r(dir_p, entry_p, byref(entry_p))
            if res:
                raise IOError()
            if not entry_p:
                break
            name = entry_p.contents.d_name
            if name not in (".", ".."):
                yield name
    finally:
        if dir_p:
            closedir(dir_p)
        if entry_p:
            c_lib.free(entry_p)

if __name__ == '__main__':
    import sys
    path = sys.argv[1]
    max_per_dir = int(sys.argv[2])
    for idx, entry in enumerate(listdirx(path)):
        if idx >= max_per_dir:
            break
        print entry
Manganite answered 4/5, 2012 at 22:3 Comment(2)
So instead of "if idx >= max_per_dir:" replace with: "if re.search('\.DPX$',entry):" Is it that simple?Alephnull
yes if you find one file that ends with .DPX you can ignore that directory. But the function is not recursive, it will only iterate over a single path.Manganite

© 2022 - 2024 — McMap. All rights reserved.