List files in a folder as a stream to begin process immediately
Asked Answered
V

4

8

I get a folder with 1 million files in it.

I would like to begin process immediately, when listing files in this folder, in Python or other script langage.

The usual functions (os.listdir in python...) are blocking and my program has to wait the end of the list, which can take a long time.

What's the best way to list huge folders ?

Vigilance answered 9/12, 2010 at 22:2 Comment(1)
You want the POSIX functions opendir/readdir, I think, but I don't believe they're available in Python's standard library. What's the processing you plan to do on the filenames?Mixer
M
12

If convenient, change your directory structure; but if not, you can use ctypes to call opendir and readdir.

Here is a copy of that code; all I did was indent it properly, add the try/finally block, and fix a bug. You might have to debug it. Particularly the struct layout.

Note that this code is not portable. You would need to use different functions on Windows, and I think the structs vary from Unix to Unix.

#!/usr/bin/python
"""
An equivalent os.listdir but as a generator using ctypes
"""

from ctypes import CDLL, c_char_p, c_int, c_long, c_ushort, c_byte, c_char, Structure, POINTER
from ctypes.util import find_library

class c_dir(Structure):
    """Opaque type for directory entries, corresponds to struct DIR"""
    pass
c_dir_p = POINTER(c_dir)

class c_dirent(Structure):
    """Directory entry"""
    # FIXME not sure these are the exactly correct types!
    _fields_ = (
        ('d_ino', c_long), # inode number
        ('d_off', c_long), # offset to the next dirent
        ('d_reclen', c_ushort), # length of this record
        ('d_type', c_byte), # type of file; not supported by all file system types
        ('d_name', c_char * 4096) # filename
        )
c_dirent_p = POINTER(c_dirent)

c_lib = CDLL(find_library("c"))
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p

# FIXME Should probably use readdir_r here
readdir = c_lib.readdir
readdir.argtypes = [c_dir_p]
readdir.restype = c_dirent_p

closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int

def listdir(path):
    """
    A generator to return the names of files in the directory passed in
    """
    dir_p = opendir(path)
    try:
        while True:
            p = readdir(dir_p)
            if not p:
                break
            name = p.contents.d_name
            if name not in (".", ".."):
                yield name
    finally:
        closedir(dir_p)

if __name__ == "__main__":
    for name in listdir("."):
        print name
Maquette answered 9/12, 2010 at 22:22 Comment(3)
Pretty sure this is missing c_dir_p = POINTER(c_dir)Calisaya
in line: dir_p = opendir(".") should be path instead of current directoryPleonasm
Bogolt: Yep. Fixed it.Maquette
P
3

This feels dirty but should do the trick:

def listdirx(dirname='.', cmd='ls'):
    proc = subprocess.Popen([cmd, dirname], stdout=subprocess.PIPE)
    filename = proc.stdout.readline()
    while filename != '':
        yield filename.rstrip('\n')
        filename = proc.stdout.readline()
    proc.communicate()

Usage: listdirx('/something/with/lots/of/files')

Pubescent answered 9/12, 2010 at 22:22 Comment(2)
ls sorts the filenames, though, at least by default. So I don't think it can start returning them any faster than os.listdir() could. Is there a flag to make ls not sort?Maquette
ls -f does not sort. Note that -f turns on the -a flag so if you don't want hidden files, hidden directories, . and .. they'd need to be filtered out.Pubescent
M
3

For people coming in off Google, PEP 471 added a proper solution to the Python 3.5 standard library and it got backported to Python 2.6+ and 3.2+ as the scandir module on PIP.

Source: https://mcmap.net/q/863305/-best-way-to-get-files-list-of-big-directory-on-python

Python 3.5+:

  • os.walk has been updated to use this infrastructure for better performance.
  • os.scandir returns an iterator over DirEntry objects.

Python 2.6/2.7 and 3.2/3.3/3.4:

  • scandir.walk is a more performant version of os.walk
  • scandir.scandir returns an iterator over DirEntry objects.

The scandir() iterators wrap opendir/readdir on POSIX platforms and FindFirstFileW/FindNextFileW on Windows.

The point of returning DirEntry objects is to allow metadata to be cached to minimize the number of system calls made. (eg. DirEntry.stat(follow_symlinks=False) never makes a system call on Windows because the FindFirstFileW and FindNextFileW functions throw in stat information for free)

Source: https://docs.python.org/3/library/os.html#os.scandir

Mufinella answered 13/5, 2016 at 9:10 Comment(1)
It seems its development started in 2013. but I didn't find it at the time (2015.). I really don't know how. Therefore I wrote myown solution and was also disappointed to discover (in scandir's code) that FindFirstFile is in kernel32.dll. Whole time hiding in front of my nose. For both reasons I was on a point of drowning myself in a teaspoon, but decided to edit my posts instead. :D You got here first, so +1! OK I still get to add info about FindFirstFile(). :DSergio
S
0

Here is your answer on how to traverse a large directory file by file on Windows!

I searched like a maniac for a Windows DLL that will allow me to do what is done on Linux, but no luck.

So, I concluded that the only way is to create my own DLL that will expose those static functions to me, but then I remembered pywintypes. And, YEEY! this is already done there. And, even more, an iterator function is already implemented! Cool!

A Windows DLL with FindFirstFile(), FindNextFile() and FindClose() may be still somewhere there but I didn't find it. So, I used pywintypes.

EDIT: They were hiding in plain sight in kernel32.dll. Please see ssokolow's answer, and my comment to it.

Sorry for dependency. But I think that you can extract win32file.pyd from ...\site-packages\win32 folder and eventual dependencies and distribute it independent of win32types with your program if you have to.

I found this question when searching on how to do this, and some others as well.

Here:

How to copy first 100 files from a directory of thousands of files using python?

I posted a full code with Linux version of listdir() from here (by Jason Orendorff) and with my Windows version that I present here.

So anyone wanting a more or less cross-platform version, go there or combine two answers yourself.

EDIT: Or better still, use scandir module or os.scandir() (in Python 3.5) and following versions. It better handles errors and some other stuff as well.

from win32file import FindFilesIterator
import os

def listdir (path):
    """
    A generator to return the names of files in the directory passed in
    """
    if "*" not in path and "?" not in path:
        st = os.stat(path) # Raise an error if dir doesn't exist or access is denied to us
        # Check if we got a dir or something else!
        # Check gotten from stat.py (for fast checking):
        if (st.st_mode & 0170000) != 0040000:
            e = OSError()
            e.errno = 20; e.filename = path; e.strerror = "Not a directory"
            raise e
        path = path.rstrip("\\/")+"\\*"
    # Else:  Decide that user knows what she/he is doing
    for file in FindFilesIterator(path):
        name = file[-2]
        # Unfortunately, only drives (eg. C:) don't include "." and ".." in the list:
        if name=="." and name=="..": continue
        yield name
Sergio answered 6/8, 2015 at 15:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.