Why is os.scandir() as slow as os.listdir()?
Asked Answered
I

1

9

I tried to optimize a file browsing function written in Python, on Windows, by using os.scandir() instead of os.listdir(). However, time remains unchanged, about 2 minutes and a half, and I can't tell why. Below are the functions, original and altered:

os.listdir() version:

def browse(self, path, tree):
    # for each entry in the path
    for entry in os.listdir(path):
        entity_path = os.path.join(path, entry)
        # check if support by git or not
        if self.git_ignore(entity_path) is False:
            # if is a dir create a new level in the tree
            if os.path.isdir( entity_path ):
                tree[entry] = Folder(entry)
                self.browse(entity_path, tree[entry])
            # if is a file add it to the tree
            if os.path.isfile(entity_path):
                tree[entry] = File(entity_path)

os.scandir() version:

def browse(self, path, tree):
    # for each entry in the path
    for dirEntry in os.scandir(path):
        entry_path = dirEntry.name
        entity_path = dirEntry.path
        # check if support by git or not
        if self.git_ignore(entity_path) is False:
            # if is a dir create a new level in the tree
            if dirEntry.is_dir(follow_symlinks=True):
                tree[entry_path] = Folder(entity_path)
                self.browse(entity_path, tree[entry_path])
            # if is a file add it to the tree
            if dirEntry.is_file(follow_symlinks=True):
                tree[entry_path] = File(entity_path)

In addition, here are the auxiliary functions used within this one:

def git_ignore(self, filepath):
    if '.git' in filepath:
        return True
    if '.ci' in filepath:
        return True
    if '.delivery' in filepath:
        return True
    child = subprocess.Popen(['git', 'check-ignore', str(filepath)],
                         stdout=subprocess.PIPE,
                         stderr=subprocess.PIPE)
    output = child.communicate()[0]
    status = child.wait()
    return status == 0

============================================================

class Folder(dict):
    def __init__(self, path):
        self.path = path
        self.categories = {}

============================================================

class File(object):
    def __init__(self, path):
        self.path = path
        self.filename, self.extension = os.path.splitext(self.path)

Does anyone have a solution for how I can make the function run faster? My assumption is that the extraction of the name and path at the beginning makes it run slower than it should, is that correct?

Interfere answered 10/12, 2019 at 13:49 Comment(1)
For every path that doesn't contain ".git", ".ci", or ".delivery", you're spawning a git child process. That's expensive, and if you have many such paths, the cumulative time spent spawning and waiting for git processes will be a bottleneck.Maurene
A
19

Regarding your question:

os.walk seems to call stats more times than necessary. That seems to be the reason why it's slower than os.scandir().

In this case, I think the best way to boost your speed performance would be to use parallel processing, which can improve the speed incredibly in some loops. There are multiple posts about this issue. Here one: Parallel Processing in Python – A Practical Guide with Examples.


Nevertheless I would like to share some thoughts about it.

I have also been wondering what are the best usage of these three options (scandir, listdir, walk). There is not much documentation about performance comparisons. Probably the best way would be to test it yourself as you did. Here my conclusions about that:

Usage of os.listdir():

It doesn't seem to have advantages compared to os.scandir() excepting that is easier to understand. I still use it when I only need to list files in directory.

PROS:

  • Fast & Simple

CONS:

  • Too simple, only works for listing files and dirs in directory, so you might need to combine it with other methods to get extra features about the files metadata. If you so, better use os.scandir().

Usage of os.walk():

This is the most used function when we need to fetch all the items in a directory (and subdirs).

PROS:

  • It's probably the easiest way to walk around all the items paths and names.

CONS:

  • It seems to call stats more times than necessary. That seems to be the reason why it's slower than os.scandir().
  • Although it gives you the root parts of the files, it doesn't provide the extra meta-info of os.scandir().

Usage of os.scandir():

It seems to have (almost) the best of both worlds. It gives you the speed of the simple os.listdir with extra features that would allow you to simplify your loops, since you could avoid using exiftool or other metadata tools when you need extra information about the files.

PROS:

  • Fast. same speed than os.listdir()
  • very nice extra features.

CONS:

  • If you want to dive into subfiles you need to create another function in order to scan over each subdir. This function is pretty simple, but maybe it would be more pythonic (I just mean with more elegant sintax) to use os.walk in this case.

So that's my view after reading a bit and using them. I'm happy to be corrected, so I can learn more about it.

Acetone answered 24/7, 2020 at 14:37 Comment(2)
Just a quick 'thank you!' for the comprehensive and well organised answer. My understanding is much clearer.Senna
I upvote this wonderful answer, but it would be great if someone supported that by real speed measurements.Sideward

© 2022 - 2024 — McMap. All rights reserved.