How to use to find files recursively?
Asked Answered
A

28

1026

I would like to list all files recursively in a directory. I currently have a directory structure like this:

  • src/main.c
  • src/dir/file1.c
  • src/another-dir/file2.c
  • src/another-dir/nested/files/file3.c

I've tried to do the following:

from glob import glob

glob(os.path.join('src','*.c'))

But this will only get be files directly in the src subfolder, e.g. I get main.c but I will not get file1.c, file2.c etc.

from glob import glob

glob(os.path.join('src','*.c'))
glob(os.path.join('src','*','*.c'))
glob(os.path.join('src','*','*','*.c'))
glob(os.path.join('src','*','*','*','*.c'))

But this is obviously limited and clunky, how can I do this properly?

Alica answered 2/2, 2010 at 18:19 Comment(1)
doesn't glob('src/**/*.c') work in this case?Zebrass
T
1804

There are a couple of ways:

pathlib.Path().rglob()

Use pathlib.Path().rglob() from the pathlib module, which was introduced in Python 3.5.

from pathlib import Path

for path in Path('src').rglob('*.c'):
    print(path.name)

glob.glob()

If you don't want to use pathlib, use glob.glob():

from glob import glob

for filename in glob('src/**/*.c', recursive=True):
    print(filename)   

For cases where matching files beginning with a dot (.); like files in the current directory or hidden files on Unix based system, use the os.walk() solution below.

os.walk()

For older Python versions, use os.walk() to recursively walk a directory and fnmatch.filter() to match against a simple expression:

import fnmatch
import os

matches = []
for root, dirnames, filenames in os.walk('src'):
    for filename in fnmatch.filter(filenames, '*.c'):
        matches.append(os.path.join(root, filename))

This version should also be faster depending on how many files you have, as the pathlib module has a bit of overhead over os.walk().

Tobit answered 2/2, 2010 at 18:26 Comment(14)
For Python older than 2.2 there is os.path.walk() which is a little more fiddly to use than os.walk()Levant
@gnibbler I know that is an old comment, but my comment is just to let people know that os.path.walk() is deprecated and has been removed in Python 3.Paradisiacal
why not .endwith('.c'), I think that will be faster than fnmatch in this scenario?Diskson
@Diskson that might work in the specific case asked in this question, but it's easy to imagine someone that wants to do use it with queries such as 'a*.c' etc, so I think it's worth keeping the current somewhat slow answer.Tobit
I concur that this is gold standard. For some reason, I wasn't able to import glob as root. In a normal user prompt, it worked but not on a root prompt, fedora 20, python 2.7. So fnmatch and this answer is a gift here.Laureen
For the Python 3 solution, glob skips directories beginning with . unless you code for them specifically. I think traversing the directory tree using os.walk is a more robust and ultimately simpler solution.Datestamp
For what it's worth, in my case finding 10,000+ files with glob was much slower than with os.walk, so I went with the latter solution for that reason.Trivial
For python 3.4, pathlib.Path('src').glob('**/*.c') should work.Ananthous
How could I exclude files and/or subfolders from that Path?Doenitz
@JohanDahlin I noticed you've updated your original post (suggesting os.walk and glob.glob) with answers already posted: fnmatch.filter (@Alex Martelli), glob.glob (@chris-piekarski), pathlib.Path.glob (@taleinat) and now pathlib.Path.rglob. I assume it is unintentional. May I recommend citing posts that precede your updates as best you can? Otherwise the good work by others that motivated your update may be overlooked. Thanks in keeping your posts updated.Jacobean
This will also match directories ending with .c. If you really want only files you must test for that.Gummous
For those reading this (it caught my teammate): pathlib will include and traverse into "hidden" files and directories (names that start with .), but glob will not. If this is undesired behavior, use os.walk() and filter out those items, or use glob.Nutritionist
NOTE: It doesn't follow symlinks, but glob.glob with recursive=True does! Too bad there is no flag like follow_symlinks=TrueSardou
Note that Path.rglob is like calling Path.glob with “**/” added in front of the given relative pattern.Jelly
G
216

For python >= 3.5 you can use **, recursive=True, i.e.:

import glob
for f in glob.glob('/path/**/*.c', recursive=True):
    print(f)

If recursive is True (default False), the pattern ** will match any files and zero or more directories and subdirectories. If the pattern is followed by an os.sep, only directories and subdirectories match.


Python 3 Demo

Gough answered 25/8, 2019 at 9:45 Comment(3)
This works better than pathlib.Path('./path/').glob('*/') because it also so in folder with size of 0Personify
In Python 3.9.1, recursive is set to False by default.Heart
recursive is also set to False by default in Python 3.8.*.Maladminister
W
123

Similar to other solutions, but using fnmatch.fnmatch instead of glob, since os.walk already listed the filenames:

import os, fnmatch


def find_files(directory, pattern):
    for root, dirs, files in os.walk(directory):
        for basename in files:
            if fnmatch.fnmatch(basename, pattern):
                filename = os.path.join(root, basename)
                yield filename


for filename in find_files('src', '*.c'):
    print 'Found C source:', filename

Also, using a generator alows you to process each file as it is found, instead of finding all the files and then processing them.

Ward answered 2/2, 2010 at 18:44 Comment(0)
D
93

I've modified the glob module to support ** for recursive globbing, e.g:

>>> import glob2
>>> all_header_files = glob2.glob('src/**/*.c')

https://github.com/miracle2k/python-glob2/

Useful when you want to provide your users with the ability to use the ** syntax, and thus os.walk() alone is not good enough.

Decontrol answered 26/6, 2011 at 14:14 Comment(4)
Can we make this stop after it finds the first match? Maybe make it possible to use it as a generator rather than having it return a list of every possible result? Also, is this a DFS or a BFS? I'd much prefer a BFS, I think, so that files which are near the root are found first. +1 for making this module and providing it on GitHub/pip.Robb
The ** syntax was added to the official glob module in Python 3.5.Robb
@Robb Alright, fine. This is still useful for < 3.5.Sequel
To activate recursive globbing using ** with the official glob module, do: glob(path, recursive=True)Rhizo
E
79

Starting with Python 3.4, one can use the glob() method of one of the Path classes in the new pathlib module, which supports ** wildcards. For example:

from pathlib import Path

for file_path in Path('src').glob('**/*.c'):
    print(file_path) # do whatever you need with these files

Update: Starting with Python 3.5, the same syntax is also supported by glob.glob().

Erdmann answered 11/11, 2014 at 16:8 Comment(3)
Indeed, and it will be in Python 3.5. It was supposed to already be so in Python 3.4, but was omitted by mistake.Erdmann
This syntax is now supported by glob.glob() as of Python 3.5.Erdmann
Note that you can also use pathlib.PurePath.relative_to in combination to get relative paths. See my answer here for more context.Incontrollable
F
42
import os
import fnmatch


def recursive_glob(treeroot, pattern):
    results = []
    for base, dirs, files in os.walk(treeroot):
        goodfiles = fnmatch.filter(files, pattern)
        results.extend(os.path.join(base, f) for f in goodfiles)
    return results

fnmatch gives you exactly the same patterns as glob, so this is really an excellent replacement for glob.glob with very close semantics. An iterative version (e.g. a generator), IOW a replacement for glob.iglob, is a trivial adaptation (just yield the intermediate results as you go, instead of extending a single results list to return at the end).

Frierson answered 2/2, 2010 at 18:39 Comment(2)
What do you think about using recursive_glob(pattern, treeroot='.') as I suggested in my edit? This way, it can be called for example as recursive_glob('*.txt') and intuitively match the syntax of glob.Contemn
@ChrisRedford, I see it as a pretty minor issue either way. As it stands now, it matches the "files then pattern" argument order of fnmatch.filter, which is roughly as useful as the possibility of matching single-argument glob.glob.Frierson
A
22

You'll want to use os.walk to collect filenames that match your criteria. For example:

import os
cfiles = []
for root, dirs, files in os.walk('src'):
  for file in files:
    if file.endswith('.c'):
      cfiles.append(os.path.join(root, file))
Ashkhabad answered 2/2, 2010 at 18:24 Comment(0)
G
17

Here's a solution with nested list comprehensions, os.walk and simple suffix matching instead of glob:

import os
cfiles = [os.path.join(root, filename)
          for root, dirnames, filenames in os.walk('src')
          for filename in filenames if filename.endswith('.c')]

It can be compressed to a one-liner:

import os;cfiles=[os.path.join(r,f) for r,d,fs in os.walk('src') for f in fs if f.endswith('.c')]

or generalized as a function:

import os

def recursive_glob(rootdir='.', suffix=''):
    return [os.path.join(looproot, filename)
            for looproot, _, filenames in os.walk(rootdir)
            for filename in filenames if filename.endswith(suffix)]

cfiles = recursive_glob('src', '.c')

If you do need full glob style patterns, you can follow Alex's and Bruno's example and use fnmatch:

import fnmatch
import os

def recursive_glob(rootdir='.', pattern='*'):
    return [os.path.join(looproot, filename)
            for looproot, _, filenames in os.walk(rootdir)
            for filename in filenames
            if fnmatch.fnmatch(filename, pattern)]

cfiles = recursive_glob('src', '*.c')
Greasewood answered 2/11, 2011 at 8:10 Comment(0)
J
11

Consider pathlib.rglob().

This is like calling Path.glob() with "**/" added in front of the given relative pattern:

import pathlib


for p in pathlib.Path("src").rglob("*.c"):
    print(p)

See also @taleinat's related post here and a similar post elsewhere.

Jacobean answered 23/5, 2019 at 12:11 Comment(0)
C
11
import os, glob

for each in glob.glob('path/**/*.c', recursive=True):
    print(f'Name with path: {each} \nName without path: {os.path.basename(each)}')
  • glob.glob('*.c') :matches all files ending in .c in current directory
  • glob.glob('*/*.c') :same as 1
  • glob.glob('**/*.c') :matches all files ending in .c in the immediate subdirectories only, but not in the current directory
  • glob.glob('*.c',recursive=True) :same as 1
  • glob.glob('*/*.c',recursive=True) :same as 3
  • glob.glob('**/*.c',recursive=True) :matches all files ending in .c in the current directory and in all subdirectories
Collop answered 3/8, 2020 at 5:10 Comment(0)
O
9

In case this may interest anyone, I've profiled the top three proposed methods. I have about ~500K files in the globbed folder (in total), and 2K files that match the desired pattern.

here's the (very basic) code

import glob
import json
import fnmatch
import os
from pathlib import Path
from time import time


def find_files_iglob():
    return glob.iglob("./data/**/data.json", recursive=True)


def find_files_oswalk():
    for root, dirnames, filenames in os.walk('data'):
        for filename in fnmatch.filter(filenames, 'data.json'):
            yield os.path.join(root, filename)

def find_files_rglob():
    return Path('data').rglob('data.json')

t0 = time()
for f in find_files_oswalk(): pass    
t1 = time()
for f in find_files_rglob(): pass
t2 = time()
for f in find_files_iglob(): pass 
t3 = time()
print(t1-t0, t2-t1, t3-t2)

And the results I got were:
os_walk: ~3.6sec
rglob ~14.5sec
iglob: ~16.9sec

The platform: Ubuntu 16.04, x86_64 (core i7),

Orly answered 13/6, 2020 at 17:39 Comment(1)
Thank you for the benchmark. I ran this on 10k files with Python 3.9.12 and the rankings are the same as in this benchmark (os.walk is fastest), although the difference is not as extreme as it is in your example.Preoccupied
Z
7

Recently I had to recover my pictures with the extension .jpg. I ran photorec and recovered 4579 directories 2.2 million files within, having tremendous variety of extensions.With the script below I was able to select 50133 files havin .jpg extension within minutes:

#!/usr/binenv python2.7

import glob
import shutil
import os

src_dir = "/home/mustafa/Masaüstü/yedek"
dst_dir = "/home/mustafa/Genel/media"
for mediafile in glob.iglob(os.path.join(src_dir, "*", "*.jpg")): #"*" is for subdirectory
    shutil.copy(mediafile, dst_dir)
Zincograph answered 5/1, 2013 at 10:36 Comment(0)
U
6

based on other answers this is my current working implementation, which retrieves nested xml files in a root directory:

files = []
for root, dirnames, filenames in os.walk(myDir):
    files.extend(glob.glob(root + "/*.xml"))

I'm really having fun with python :)

Ungrounded answered 28/7, 2012 at 22:9 Comment(0)
A
6

For python 3.5 and later

import glob

#file_names_array = glob.glob('path/*.c', recursive=True)
#above works for files directly at path/ as guided by NeStack

#updated version
file_names_array = glob.glob('path/**/*.c', recursive=True)

further you might need

for full_path_in_src in  file_names_array:
    print (full_path_in_src ) # be like 'abc/xyz.c'
    #Full system path of this would be like => 'path till src/abc/xyz.c'
Achromatism answered 21/6, 2019 at 21:8 Comment(1)
Your first line of code doesn't work for looking into subdirectories. But if you just expand it by /** it works for me, like that: file_names_array = glob.glob('src/**/*.c', recursive=True)Fable
R
5

Johan and Bruno provide excellent solutions on the minimal requirement as stated. I have just released Formic which implements Ant FileSet and Globs which can handle this and more complicated scenarios. An implementation of your requirement is:

import formic
fileset = formic.FileSet(include="/src/**/*.c")
for file_name in fileset.qualified_files():
    print file_name
Raillery answered 15/5, 2012 at 8:53 Comment(1)
Formic appears to be abandoned?! And it does not support Python 3 (bitbucket.org/aviser/formic/issue/12/support-python-3)Hibben
T
3

Another way to do it using just the glob module. Just seed the rglob method with a starting base directory and a pattern to match and it will return a list of matching file names.

import glob
import os

def _getDirs(base):
    return [x for x in glob.iglob(os.path.join( base, '*')) if os.path.isdir(x) ]

def rglob(base, pattern):
    list = []
    list.extend(glob.glob(os.path.join(base,pattern)))
    dirs = _getDirs(base)
    if len(dirs):
        for d in dirs:
            list.extend(rglob(os.path.join(base,d), pattern))
    return list
Tylertylosis answered 13/9, 2011 at 22:59 Comment(0)
T
3

Or with a list comprehension:

 >>> base = r"c:\User\xtofl"
 >>> binfiles = [ os.path.join(base,f) 
            for base, _, files in os.walk(root) 
            for f in files if f.endswith(".jpg") ] 
Twoup answered 24/6, 2013 at 10:41 Comment(0)
S
3

If the files are on a remote file system or inside an archive, you can use an implementation of the fsspec AbstractFileSystem class. For example, to list all the files in a zipfile:

from fsspec.implementations.zip import ZipFileSystem
fs = ZipFileSystem("/tmp/test.zip")
fs.glob("/**")  # equivalent: fs.find("/")

or to list all the files in a publicly available S3 bucket:

from s3fs import S3FileSystem
fs_s3 = S3FileSystem(anon=True)
fs_s3.glob("noaa-goes16/ABI-L1b-RadF/2020/045/**")  # or use fs_s3.find

you can also use it for a local filesystem, which may be interesting if your implementation should be filesystem-agnostic:

from fsspec.implementations.local import LocalFileSystem
fs = LocalFileSystem()
fs.glob("/tmp/test/**")

Other implementations include Google Cloud, Github, SFTP/SSH, Dropbox, and Azure. For details, see the fsspec API documentation.

Seften answered 8/10, 2020 at 13:49 Comment(0)
P
2

Just made this.. it will print files and directory in hierarchical way

But I didn't used fnmatch or walk

#!/usr/bin/python

import os,glob,sys

def dirlist(path, c = 1):

        for i in glob.glob(os.path.join(path, "*")):
                if os.path.isfile(i):
                        filepath, filename = os.path.split(i)
                        print '----' *c + filename

                elif os.path.isdir(i):
                        dirname = os.path.basename(i)
                        print '----' *c + dirname
                        c+=1
                        dirlist(i,c)
                        c-=1


path = os.path.normpath(sys.argv[1])
print(os.path.basename(path))
dirlist(path)
Pascoe answered 27/7, 2013 at 18:12 Comment(0)
A
2

That one uses fnmatch or regular expression:

import fnmatch, os

def filepaths(directory, pattern):
    for root, dirs, files in os.walk(directory):
        for basename in files:
            try:
                matched = pattern.match(basename)
            except AttributeError:
                matched = fnmatch.fnmatch(basename, pattern)
            if matched:
                yield os.path.join(root, basename)

# usage
if __name__ == '__main__':
    from pprint import pprint as pp
    import re
    path = r'/Users/hipertracker/app/myapp'
    pp([x for x in filepaths(path, re.compile(r'.*\.py$'))])
    pp([x for x in filepaths(path, '*.py')])
Antiknock answered 2/8, 2013 at 16:1 Comment(0)
C
2

In addition to the suggested answers, you can do this with some lazy generation and list comprehension magic:

import os, glob, itertools

results = itertools.chain.from_iterable(glob.iglob(os.path.join(root,'*.c'))
                                               for root, dirs, files in os.walk('src'))

for f in results: print(f)

Besides fitting in one line and avoiding unnecessary lists in memory, this also has the nice side effect, that you can use it in a way similar to the ** operator, e.g., you could use os.path.join(root, 'some/path/*.c') in order to get all .c files in all sub directories of src that have this structure.

Canaletto answered 5/12, 2015 at 17:42 Comment(0)
I
2

This is a working code on Python 2.7. As part of my devops work, I was required to write a script which would move the config files marked with live-appName.properties to appName.properties. There could be other extension files as well like live-appName.xml.

Below is a working code for this, which finds the files in the given directories (nested level) and then renames (moves) it to the required filename

def flipProperties(searchDir):
   print "Flipping properties to point to live DB"
   for root, dirnames, filenames in os.walk(searchDir):
      for filename in fnmatch.filter(filenames, 'live-*.*'):
        targetFileName = os.path.join(root, filename.split("live-")[1])
        print "File "+ os.path.join(root, filename) + "will be moved to " + targetFileName
        shutil.move(os.path.join(root, filename), targetFileName)

This function is called from a main script

flipProperties(searchDir)

Hope this helps someone struggling with similar issues.

Ireneirenic answered 3/4, 2020 at 10:3 Comment(0)
P
1

Simplified version of Johan Dahlin's answer, without fnmatch.

import os

matches = []
for root, dirnames, filenames in os.walk('src'):
  matches += [os.path.join(root, f) for f in filenames if f[-2:] == '.c']
Phraseology answered 3/6, 2013 at 1:29 Comment(0)
N
1

Here is my solution using list comprehension to search for multiple file extensions recursively in a directory and all subdirectories:

import os, glob

def _globrec(path, *exts):
""" Glob recursively a directory and all subdirectories for multiple file extensions 
    Note: Glob is case-insensitive, i. e. for '\*.jpg' you will get files ending
    with .jpg and .JPG

    Parameters
    ----------
    path : str
        A directory name
    exts : tuple
        File extensions to glob for

    Returns
    -------
    files : list
        list of files matching extensions in exts in path and subfolders

    """
    dirs = [a[0] for a in os.walk(path)]
    f_filter = [d+e for d in dirs for e in exts]    
    return [f for files in [glob.iglob(files) for files in f_filter] for f in files]

my_pictures = _globrec(r'C:\Temp', '\*.jpg','\*.bmp','\*.png','\*.gif')
for f in my_pictures:
    print f
Netty answered 18/8, 2014 at 17:50 Comment(0)
R
0
import sys, os, glob

dir_list = ["c:\\books\\heap"]

while len(dir_list) > 0:
    cur_dir = dir_list[0]
    del dir_list[0]
    list_of_files = glob.glob(cur_dir+'\\*')
    for book in list_of_files:
        if os.path.isfile(book):
            print(book)
        else:
            dir_list.append(book)
Rosinarosinante answered 27/1, 2014 at 19:3 Comment(0)
P
0

I modified the top answer in this posting.. and recently created this script which will loop through all files in a given directory (searchdir) and the sub-directories under it... and prints filename, rootdir, modified/creation date, and size.

Hope this helps someone... and they can walk the directory and get fileinfo.

import time
import fnmatch
import os

def fileinfo(file):
    filename = os.path.basename(file)
    rootdir = os.path.dirname(file)
    lastmod = time.ctime(os.path.getmtime(file))
    creation = time.ctime(os.path.getctime(file))
    filesize = os.path.getsize(file)

    print "%s**\t%s\t%s\t%s\t%s" % (rootdir, filename, lastmod, creation, filesize)

searchdir = r'D:\Your\Directory\Root'
matches = []

for root, dirnames, filenames in os.walk(searchdir):
    ##  for filename in fnmatch.filter(filenames, '*.c'):
    for filename in filenames:
        ##      matches.append(os.path.join(root, filename))
        ##print matches
        fileinfo(os.path.join(root, filename))
Paranoiac answered 15/11, 2014 at 13:39 Comment(0)
H
0

Here is a solution that will match the pattern against the full path and not just the base filename.

It uses fnmatch.translate to convert a glob-style pattern into a regular expression, which is then matched against the full path of each file found while walking the directory.

re.IGNORECASE is optional, but desirable on Windows since the file system itself is not case-sensitive. (I didn't bother compiling the regex because docs indicate it should be cached internally.)

import fnmatch
import os
import re

def findfiles(dir, pattern):
    patternregex = fnmatch.translate(pattern)
    for root, dirs, files in os.walk(dir):
        for basename in files:
            filename = os.path.join(root, basename)
            if re.search(patternregex, filename, re.IGNORECASE):
                yield filename
Hoofer answered 30/6, 2015 at 15:39 Comment(0)
R
-1

I needed a solution for python 2.x that works fast on large directories.
I endet up with this:

import subprocess
foundfiles= subprocess.check_output("ls src/*.c src/**/*.c", shell=True)
for foundfile in foundfiles.splitlines():
    print foundfile

Note that you might need some exception handling in case ls doesn't find any matching file.

Ragtime answered 23/6, 2017 at 10:20 Comment(3)
I just realized that ls src/**/*.c only works if globstar option is enabled (shopt -s globstar) - see this answer for details.Ragtime
A subprocess is never a good solution if you want to go fast, and ls in scripts is definitely something to avoid.Revenuer
Ok, I didn't know about this. It works for me - and takes less than a second (instead of more than 30 seconds...)Ragtime

© 2022 - 2024 — McMap. All rights reserved.