How to do a recursive sub-folder search and return files in a list?

Asked 23/8, 2013 at 3:18 Answered 16/6, 2021 at 10:53

270

I am working on a script to recursively go through subfolders in a mainfolder and build a list off a certain file type. I am having an issue with the script. It's currently set as follows:

for root, subFolder, files in os.walk(PATH):
    for item in files:
        if item.endswith(".txt") :
            fileNamePath = str(os.path.join(root,subFolder,item))

the problem is that the subFolder variable is pulling in a list of subfolders rather than the folder that the ITEM file is located. I was thinking of running a for loop for the subfolder before and join the first part of the path but I figured I'd double check to see if anyone has any suggestions before that.

Evocative answered 23/8, 2013 at 3:18 Comment(0)

310

You should be using the dirpath which you call root. The dirnames are supplied so you can prune it if there are folders that you don't wish os.walk to recurse into.

import os
result = [os.path.join(dp, f) for dp, dn, filenames in os.walk(PATH) for f in filenames if os.path.splitext(f)[1] == '.txt']

Edit:

After the latest downvote, it occurred to me that glob is a better tool for selecting by extension.

import os
from glob import glob
result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]

Also a generator version

from itertools import chain
result = (chain.from_iterable(glob(os.path.join(x[0], '*.txt')) for x in os.walk('.')))

Edit2 for Python 3.4+

from pathlib import Path
result = list(Path(".").rglob("*.[tT][xX][tT]"))

Selfeffacement answered 23/8, 2013 at 3:24 Comment(8)

'*.[Tt][Xx][Tt]' glob pattern will make the search case-insensitive. – Torquemada 29/8, 2018 at 16:50

@SergiyKolesnikov, Thanks, I've used that in the edit at the bottom. Note that the rglob is insensitive on Windows platforms - but it's not portably insensitive. – Selfeffacement 29/8, 2018 at 19:15

@JohnLaRooy It works with glob too (Python 3.6 here): glob.iglob(os.path.join(real_source_path, '**', '*.[xX][mM][lL]') – Torquemada 29/8, 2018 at 19:22

@Sergiy: Your iglob does not work for files in sub-sub folders or below. You need to add recursive=True. – Keverne 18/1, 2020 at 16:48

I don't see how glob "is a better tool for selecting by extension". It's slow. Like half the speed of your other solution. I did a full speed analysis here: https://mcmap.net/q/41115/-how-to-do-a-recursive-sub-folder-search-and-return-files-in-a-list – Keverne 18/1, 2020 at 18:54

@user136036, "better" does not always mean fastest. Sometimes readability and maintainability are also important. – Selfeffacement 21/1, 2020 at 0:55

AFAIK Path defaults to current directory so '.' is not necessary: Path().rglob("foo") – Flann 3/6, 2020 at 8:49

243

Changed in Python 3.5: Support for recursive globs using “**”.

glob.glob() got a new recursive parameter.

If you want to get every .txt file under my_path (recursively including subdirs):

import glob

files = glob.glob(my_path + '/**/*.txt', recursive=True)

# my_path/     the dir
# **/       every file and dir under my_path
# *.txt     every file that ends with '.txt'

If you need an iterator you can use iglob as an alternative:

for file in glob.iglob(my_path, recursive=True):
    # ...

Forbes answered 23/11, 2016 at 4:0 Comment(10)

TypeError: glob() got an unexpected keyword argument 'recursive' – Jovitajovitah 6/12, 2016 at 14:55

It should be working. Make sure you use a version >= 3.5. I added a link to the documentation in my answer for more detail. – Forbes 6/12, 2016 at 15:19

That would be why, I'm on 2.7 – Jovitajovitah 6/12, 2016 at 15:36

Why the list comprehension and not just files = glob.glob(PATH + '/*/**/*.txt', recursive=True)? – Ashleaashlee 19/10, 2017 at 14:54

Whoops! :) It's totally redundant. No idea what made me write it like that. Thanks for mentioning it! I'll fix it. – Forbes 19/10, 2017 at 16:53

Note : Using my_path + '/** instead of my_path + '/**/* as stated in the answer will include the current directory too. – Contemplative 20/7, 2018 at 9:21

'*.[Tt][Xx][Tt]' glob pattern will make the search case-insensitive. – Torquemada 29/8, 2018 at 16:52

What are these /**/ for? – Palmapalmaceous 31/10, 2020 at 11:46

@Palmapalmaceous From docs:

If recursive is true, the pattern “**” will match any files and zero or more directories, subdirectories and symbolic links to directories. If the pattern is followed by an os.sep or os.altsep then files will not match.

– Rimmer 20/7, 2022 at 13:5

Notably, as of Python 3.10, this is the fastest solution as well, as shown by @Keverne above. – Biparous 26/10, 2022 at 10:24

This seems to be the fastest solution I could come up with, and is faster than os.walk and a lot faster than any glob solution.

It will also give you a list of all nested subfolders at basically no cost.
You can search for several different extensions.
You can also choose to return either full paths or just the names for the files by changing f.path to f.name (do not change it for subfolders!).

Args: dir: str, ext: list.
Function returns two lists: subfolders, files.

See below for a detailed speed anaylsis.

def run_fast_scandir(dir, ext):    # dir: str, ext: list
    subfolders, files = [], []

    for f in os.scandir(dir):
        if f.is_dir():
            subfolders.append(f.path)
        if f.is_file():
            if os.path.splitext(f.name)[1].lower() in ext:
                files.append(f.path)


    for dir in list(subfolders):
        sf, f = run_fast_scandir(dir, ext)
        subfolders.extend(sf)
        files.extend(f)
    return subfolders, files


subfolders, files = run_fast_scandir(folder, [".jpg"])

In case you need the file size, you can also create a sizes list and add f.stat().st_size like this for a display of MiB:

sizes.append(f"{f.stat().st_size/1024/1024:.0f} MiB")

Speed analysis

for various methods to get all files with a specific file extension inside all subfolders and the main folder.

tl;dr:

fast_scandir clearly wins and is twice as fast as all other solutions, except os.walk.
os.walk is second place slighly slower.
using glob will greatly slow down the process.
None of the results use natural sorting. This means results will be sorted like this: 1, 10, 2. To get natural sorting (1, 2, 10), please have a look at:

https://mcmap.net/q/41116/-non-alphanumeric-list-order-from-os-listdir

Results:

fast_scandir    took  499 ms. Found files: 16596. Found subfolders: 439
os.walk         took  589 ms. Found files: 16596
find_files      took  919 ms. Found files: 16596
glob.iglob      took  998 ms. Found files: 16596
glob.glob       took 1002 ms. Found files: 16596
pathlib.rglob   took 1041 ms. Found files: 16596
os.walk-glob    took 1043 ms. Found files: 16596

Updated: 2022-07-20 (Py 3.10.1 looking for *.pdf)

glob.iglob      took 132 ms. Found files: 9999
glob.glob       took 134 ms. Found files: 9999
fast_scandir    took 331 ms. Found files: 9999. Found subfolders: 9330
os.walk         took 695 ms. Found files: 9999
pathlib.rglob   took 828 ms. Found files: 9999
find_files      took 949 ms. Found files: 9999
os.walk-glob    took 1242 ms. Found files: 9999

Tests were done with W7x64, Python 3.8.1, 20 runs. 16596 files in 439 (partially nested) subfolders.
find_files is from https://mcmap.net/q/41115/-how-to-do-a-recursive-sub-folder-search-and-return-files-in-a-list and lets you search for several extensions.
fast_scandir was written by myself and will also return a list of subfolders. You can give it a list of extensions to search for (I tested a list with one entry to a simple if ... == ".jpg" and there was no significant difference).

# -*- coding: utf-8 -*-
# Python 3


import time
import os
from glob import glob, iglob
from pathlib import Path


directory = r"<folder>"
RUNS = 20


def run_os_walk():
    a = time.time_ns()
    for i in range(RUNS):
        fu = [os.path.join(dp, f) for dp, dn, filenames in os.walk(directory) for f in filenames if
                  os.path.splitext(f)[1].lower() == '.jpg']
    print(f"os.walk\t\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_os_walk_glob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = [y for x in os.walk(directory) for y in glob(os.path.join(x[0], '*.jpg'))]
    print(f"os.walk-glob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_glob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = glob(os.path.join(directory, '**', '*.jpg'), recursive=True)
    print(f"glob.glob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_iglob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = list(iglob(os.path.join(directory, '**', '*.jpg'), recursive=True))
    print(f"glob.iglob\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def run_pathlib_rglob():
    a = time.time_ns()
    for i in range(RUNS):
        fu = list(Path(directory).rglob("*.jpg"))
    print(f"pathlib.rglob\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(fu)}")


def find_files(files, dirs=[], extensions=[]):
    # https://mcmap.net/q/41115/-how-to-do-a-recursive-sub-folder-search-and-return-files-in-a-list

    new_dirs = []
    for d in dirs:
        try:
            new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
        except OSError:
            if os.path.splitext(d)[1].lower() in extensions:
                files.append(d)

    if new_dirs:
        find_files(files, new_dirs, extensions )
    else:
        return


def run_fast_scandir(dir, ext):    # dir: str, ext: list
    # https://mcmap.net/q/41115/-how-to-do-a-recursive-sub-folder-search-and-return-files-in-a-list

    subfolders, files = [], []

    for f in os.scandir(dir):
        if f.is_dir():
            subfolders.append(f.path)
        if f.is_file():
            if os.path.splitext(f.name)[1].lower() in ext:
                files.append(f.path)


    for dir in list(subfolders):
        sf, f = run_fast_scandir(dir, ext)
        subfolders.extend(sf)
        files.extend(f)
    return subfolders, files



if __name__ == '__main__':
    run_os_walk()
    run_os_walk_glob()
    run_glob()
    run_iglob()
    run_pathlib_rglob()


    a = time.time_ns()
    for i in range(RUNS):
        files = []
        find_files(files, dirs=[directory], extensions=[".jpg"])
    print(f"find_files\t\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}")


    a = time.time_ns()
    for i in range(RUNS):
        subf, files = run_fast_scandir(directory, [".jpg"])
    print(f"fast_scandir\ttook {(time.time_ns() - a) / 1000 / 1000 / RUNS:.0f} ms. Found files: {len(files)}. Found subfolders: {len(subf)}")

Keverne answered 18/1, 2020 at 18:52 Comment(6)

Great solution, but I had one issue with it that took me a bit to figure out. Using your fast_scandir code, when it hits a file path beginning with a '.' and without an extension, such as .DS_Store or .gitignore, the if os.path.splitext(f.name)[1].lower() in ext will always return true, which is literally asking if '' in '.jpg' in your example. I recommend adding a length check (i.e. if len(os.path.splitext(f.name)[1]) > 0 and os.path.splitext(f.name)[1].lower() in ext). – Pekoe 30/10, 2020 at 17:3

@BrandonHunter, it does not return True. print( os.path.splitext(".DS_Store")[1].lower() in [".jpg"] ) -> False. Keep in mind ext is a list and not a string. – Keverne 31/10, 2020 at 18:17

You can eliminate the recursive nature of this function by appending dir to subfolders at the beginning of the function and then adding an outer loop that iterates over subfolders. This should give a very small speed improvement, especially for very deep directory structures. It also frees up the function's output in case you need to return something other than subfolders and files. Note that depending on the way you add and access elements of subfolders, the ordering of the output could be different. – Taproot 9/6, 2021 at 19:10

Looks like this is not true. On benchmarking your code snippet for larger dataset, it takes more time than that of the code that uses glob. However the code works as expected. – Uranic 14/9, 2021 at 13:26

glob is now 3x faster than fast_scandir when using Py 3.10.1. – Rimmer 20/7, 2022 at 13:46

Further fast_scandir acctually does not run on all types of network shares since the recursion kills the drives capacities. Dont use it if you do serious stuff – Tal 24/6, 2023 at 8:45

I will translate John La Rooy's list comprehension to nested for's, just in case anyone else has trouble understanding it.

result = [y for x in os.walk(PATH) for y in glob(os.path.join(x[0], '*.txt'))]

Should be equivalent to:

import glob
import os

result = []

for x in os.walk(PATH):
    for y in glob.glob(os.path.join(x[0], '*.txt')):
        result.append(y)

Here's the documentation for list comprehension and the functions os.walk and glob.glob.

Milda answered 10/5, 2018 at 20:6 Comment(1)

This answer worked for me in Python 3.7.3. glob.glob(..., recursive=True) and list(Path(dir).glob(...')) did not. – Swede 4/6, 2019 at 21:13

The new pathlib library simplifies this to one line:

from pathlib import Path
result = list(Path(PATH).glob('**/*.txt'))

You can also use the generator version:

from pathlib import Path
for file in Path(PATH).glob('**/*.txt'):
    pass

This returns Path objects, which you can use for pretty much anything, or get the file name as a string by file.name.

Colonize answered 22/5, 2018 at 19:3 Comment(0)

Your original solution was very nearly correct, but the variable "root" is dynamically updated as it recursively paths around. os.walk() is a recursive generator. Each tuple set of (root, subFolder, files) is for a specific root the way you have it setup.

i.e.

root = 'C:\\'
subFolder = ['Users', 'ProgramFiles', 'ProgramFiles (x86)', 'Windows', ...]
files = ['foo1.txt', 'foo2.txt', 'foo3.txt', ...]

root = 'C:\\Users\\'
subFolder = ['UserAccount1', 'UserAccount2', ...]
files = ['bar1.txt', 'bar2.txt', 'bar3.txt', ...]

...

I made a slight tweak to your code to print a full list.

import os
for root, subFolder, files in os.walk(PATH):
    for item in files:
        if item.endswith(".txt") :
            fileNamePath = str(os.path.join(root,item))
            print(fileNamePath)

Hope this helps!

EDIT: (based on feeback)

OP misunderstood/mislabeled the subFolder variable, as it is actually all the sub folders in "root". Because of this, OP, you're trying to do os.path.join(str, list, str), which probably doesn't work out like you expected.

To help add clarity, you could try this labeling scheme:

import os
for current_dir_path, current_subdirs, current_files in os.walk(RECURSIVE_ROOT):
    for aFile in current_files:
        if aFile.endswith(".txt") :
            txt_file_path = str(os.path.join(current_dir_path, aFile))
            print(txt_file_path)

Cyrillic answered 7/7, 2020 at 19:36 Comment(3)

Elegant solution - thanks for explaining walk's recursive generator! – Tangelatangelo 21/8, 2020 at 9:30

In some sense, this should be the accepted answer, though I feel perhaps it could explain the OP's mistake in some more detail. – Camphorate 9/2, 2021 at 12:56

@triplee : detail added. Thanks for the feedback. :) – Cyrillic 11/2, 2021 at 0:12

Its not the most pythonic answer, but I'll put it here for fun because it's a neat lesson in recursion

def find_files( files, dirs=[], extensions=[]):
    new_dirs = []
    for d in dirs:
        try:
            new_dirs += [ os.path.join(d, f) for f in os.listdir(d) ]
        except OSError:
            if os.path.splitext(d)[1] in extensions:
                files.append(d)

    if new_dirs:
        find_files(files, new_dirs, extensions )
    else:
        return

On my machine I have two folders, root and root2

mender@multivax ]ls -R root root2
root:
temp1 temp2

root/temp1:
temp1.1 temp1.2

root/temp1/temp1.1:
f1.mid

root/temp1/temp1.2:
f.mi  f.mid

root/temp2:
tmp.mid

root2:
dummie.txt temp3

root2/temp3:
song.mid

Lets say I want to find all .txt and all .mid files in either of these directories, then I can just do

files = []
find_files( files, dirs=['root','root2'], extensions=['.mid','.txt'] )
print(files)

#['root2/dummie.txt',
# 'root/temp2/tmp.mid',
# 'root2/temp3/song.mid',
# 'root/temp1/temp1.1/f1.mid',
# 'root/temp1/temp1.2/f.mid']

Elinoreeliot answered 12/8, 2017 at 3:59 Comment(0)

You can do it this way to return you a list of absolute path files.

def list_files_recursive(path):
    """
    Function that receives as a parameter a directory path
    :return list_: File List and Its Absolute Paths
    """

    import os

    files = []

    # r = root, d = directories, f = files
    for r, d, f in os.walk(path):
        for file in f:
            files.append(os.path.join(r, file))

    lst = [file for file in files]
    return lst


if __name__ == '__main__':

    result = list_files_recursive('/tmp')
    print(result)

Otherworld answered 13/11, 2019 at 6:0 Comment(0)

Recursive is new in Python 3.5, so it won't work on Python 2.7. Here is the example that uses r strings so you just need to provide the path as is on either Win, Lin, ...

import glob

mypath=r"C:\Users\dj\Desktop\nba"

files = glob.glob(mypath + r'\**\*.py', recursive=True)
# print(files) # as list
for f in files:
    print(f) # nice looking single line per file

Note: It will list all files, no matter how deep it should go.

Tread answered 30/5, 2019 at 16:9 Comment(0)

This function will recursively put only files into a list.

import os


def ls_files(dir):
    files = list()
    for item in os.listdir(dir):
        abspath = os.path.join(dir, item)
        try:
            if os.path.isdir(abspath):
                files = files + ls_files(abspath)
            else:
                files.append(abspath)
        except FileNotFoundError as err:
            print('invalid directory\n', 'Error: ', err)
    return files

Spoon answered 30/9, 2019 at 15:22 Comment(0)

If you don't mind installing an additional light library, you can do this:

pip install plazy

Usage:

import plazy

txt_filter = lambda x : True if x.endswith('.txt') else False
files = plazy.list_files(root='data', filter_func=txt_filter, is_include_root=True)

The result should look something like this:

['data/a.txt', 'data/b.txt', 'data/sub_dir/c.txt']

It works on both Python 2.7 and Python 3.

Github: https://github.com/kyzas/plazy#list-files

Disclaimer: I'm an author of plazy.

Hume answered 12/12, 2019 at 5:39 Comment(0)

You can use the "recursive" setting within glob module to search through subdirectories

For example:

import glob
glob.glob('//Mypath/folder/**/*',recursive = True)

The second line would return all files within subdirectories for that folder location (Note, you need the '**/*' string at the end of your folder string to do this.)

If you specifically wanted to find text files deep within your subdirectories, you can use

glob.glob('//Mypath/folder/**/*.txt',recursive = True)

Sander answered 22/10, 2020 at 11:53 Comment(0)

A simplest and most basic method:

import os
for parent_path, _, filenames in os.walk('.'):
    for f in filenames:
        print(os.path.join(parent_path, f))

Modica answered 16/6, 2021 at 10:53 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags