Get a filtered list of files in a directory
Asked Answered
A

14

434

I am trying to get a list of files in a directory using Python, but I do not want a list of ALL the files.

What I essentially want is the ability to do something like the following but using Python and not executing ls.

ls 145592*.jpg

If there is no built-in method for this, I am currently thinking of writing a for loop to iterate through the results of an os.listdir() and to append all the matching files to a new list.

However, there are a lot of files in that directory and therefore I am hoping there is a more efficient method (or a built-in method).

Absinthism answered 8/2, 2010 at 23:2 Comment(2)
[This link might help you :) Get a filtered list of files in a directory ](codereview.stackexchange.com/a/33642)Casabonne
Note that you might take special care about sorting order if this is important for your application.Theoretical
S
609
import glob

jpgFilenamesList = glob.glob('145592*.jpg')

See glob in python documenttion

Sym answered 8/2, 2010 at 23:5 Comment(14)
Oh, I just noticed that the Python docs say glob() "is done by using the os.listdir() and fnmatch.fnmatch() functions in concert, and not by actually invoking a subshell". In other words, glob() doesn't have the efficiency improvements one might expect.Profane
There is one main difference: glob.glob('145592*.jpg') prints the whole absolute path of files while ls 145592*.jpg prints only the list of files.Rebak
@Ben Why would invoking a subshell (subprocess) have any efficiency improvements?Decor
@PauloNeves: true, my comment above doesn't make sense to me 7 years later either. :-) I'm guessing I was referring to the fact that glob() just uses listdir+fnmatch, rather than special operating system calls to do the wildcard filtering. For example, on Windows the FindFirstFile API allows you to specify wildcards so the OS does the filtering directly, and presumably more efficiently (I don't think there's an equivalent on Linux).Profane
@IgnacioVazquez-Abrams Oh, duh, thanks! And that's probably what I was referring to in the first place...Profane
What directory does this search?Crapulous
@marsh: As always, the process's current working directory.Sym
@ÉbeIsaac And how can you filter not just by files named 145592*.jpg, but all files that fulfil (145592*.jpg OR 145592*.png OR 145592*.gif)?Minnaminnaminnie
Be aware that the result is not sorted by alphabetical order or anything like that.Theoretical
Don't forget to use import globTrochilus
where do you specify the directory?Ovotestis
Does glob work in Windows too?Finite
How does it know which directory to search?Endurant
@Endurant if you do not want to look in the current working directory than provide the path as part of the filename. You can use either relative or absolute path and even have wildcards in the path itself if needed.Rosenberry
P
183

glob.glob() is definitely the way to do it (as per Ignacio). However, if you do need more complicated matching, you can do it with a list comprehension and re.match(), something like so:

files = [f for f in os.listdir('.') if re.match(r'[0-9]+.*\.jpg', f)]

More flexible, but as you note, less efficient.

Profane answered 9/2, 2010 at 0:27 Comment(2)
This definitely seems to be more powerful. For example, having to do something like [0-9]+Storekeeper
Yes, definitely more powerful -- however fnmatch does support [0123456789] sequences (see docs), and it also has the fnmatch.filter() function which makes this loop slightly more efficient.Profane
C
88

Keep it simple:

import os
relevant_path = "[path to folder]"
included_extensions = ['jpg','jpeg', 'bmp', 'png', 'gif']
file_names = [fn for fn in os.listdir(relevant_path)
              if any(fn.endswith(ext) for ext in included_extensions)]

I prefer this form of list comprehensions because it reads well in English.

I read the fourth line as: For each fn in os.listdir for my path, give me only the ones that match any one of my included extensions.

It may be hard for novice python programmers to really get used to using list comprehensions for filtering, and it can have some memory overhead for very large data sets, but for listing a directory and other simple string filtering tasks, list comprehensions lead to more clean documentable code.

The only thing about this design is that it doesn't protect you against making the mistake of passing a string instead of a list. For example if you accidentally convert a string to a list and end up checking against all the characters of a string, you could end up getting a slew of false positives.

But it's better to have a problem that's easy to fix than a solution that's hard to understand.

Coh answered 13/1, 2014 at 16:27 Comment(3)
Not that there is any need for any() here, because str.endswith() takes a sequence of endings. if fn.endswith(included_extentensions) is more than enough.Subcontract
Apart from the inefficiency of not using str.endswith(seq) that Martijn pointed out, this is not correct, because a file has to end with .ext for it to have that extension. This code will also find (for example) a file called "myjpg" or a directory named just "png". To fix, just prefix each extension in included_extensions with a ..Profane
I'm always a bit wary of code in answers which obviously hasn't been run or can't run. The variable included_extensions vs included_extentsions? A pity because otherwise this is my preferred answer.Vernavernacular
C
62

Another option:

>>> import os, fnmatch
>>> fnmatch.filter(os.listdir('.'), '*.py')
['manage.py']

https://docs.python.org/3/library/fnmatch.html

Coracorabel answered 28/1, 2016 at 11:55 Comment(3)
This is exactly what glob does on a single line.Laoag
Only difference is glob returns the full path as opposed to os.listdir just returning the file name. At least this is what is happening in Python 2.Rhoades
A very nice solution. Especially for those who are already using fnmatch and os in their script and don't want to import another module ie. glob.Castellated
A
39

Filter with glob module:

Import glob

import glob

Wild Cards:

files=glob.glob("data/*")
print(files)

Out:

['data/ks_10000_0', 'data/ks_1000_0', 'data/ks_100_0', 'data/ks_100_1',
'data/ks_100_2', 'data/ks_106_0', 'data/ks_19_0', 'data/ks_200_0', 'data/ks_200_1', 
'data/ks_300_0', 'data/ks_30_0', 'data/ks_400_0', 'data/ks_40_0', 'data/ks_45_0', 
'data/ks_4_0', 'data/ks_500_0', 'data/ks_50_0', 'data/ks_50_1', 'data/ks_60_0', 
'data/ks_82_0', 'data/ks_lecture_dp_1', 'data/ks_lecture_dp_2']

Fiter extension .txt:

files = glob.glob("/home/ach/*/*.txt")

A single character

glob.glob("/home/ach/file?.txt")

Number Ranges

glob.glob("/home/ach/*[0-9]*")

Alphabet Ranges

glob.glob("/home/ach/[a-c]*")
Appropriate answered 31/3, 2019 at 9:0 Comment(0)
B
21

Preliminary code

import glob
import fnmatch
import pathlib
import os

pattern = '*.py'
path = '.'

Solution 1 - use "glob"

# lookup in current dir
glob.glob(pattern)

In [2]: glob.glob(pattern)
Out[2]: ['wsgi.py', 'manage.py', 'tasks.py']

Solution 2 - use "os" + "fnmatch"

Variant 2.1 - Lookup in current dir

# lookup in current dir
fnmatch.filter(os.listdir(path), pattern)

In [3]: fnmatch.filter(os.listdir(path), pattern)
Out[3]: ['wsgi.py', 'manage.py', 'tasks.py']

Variant 2.2 - Lookup recursive

# lookup recursive
for dirpath, dirnames, filenames in os.walk(path):

    if not filenames:
        continue

    pythonic_files = fnmatch.filter(filenames, pattern)
    if pythonic_files:
        for file in pythonic_files:
            print('{}/{}'.format(dirpath, file))

Result

./wsgi.py
./manage.py
./tasks.py
./temp/temp.py
./apps/diaries/urls.py
./apps/diaries/signals.py
./apps/diaries/actions.py
./apps/diaries/querysets.py
./apps/library/tests/test_forms.py
./apps/library/migrations/0001_initial.py
./apps/polls/views.py
./apps/polls/formsets.py
./apps/polls/reports.py
./apps/polls/admin.py

Solution 3 - use "pathlib"

# lookup in current dir
path_ = pathlib.Path('.')
tuple(path_.glob(pattern))

# lookup recursive
tuple(path_.rglob(pattern))

Notes:

  1. Tested on the Python 3.4
  2. The module "pathlib" was added only in the Python 3.4
  3. The Python 3.5 added a feature for recursive lookup with glob.glob https://docs.python.org/3.5/library/glob.html#glob.glob. Since my machine is installed with Python 3.4, I have not tested that.
Bengurion answered 12/11, 2016 at 19:32 Comment(0)
A
14

You can use pathlib that is available in Python standard library 3.4 and above.

from pathlib import Path

files = [f for f in Path.cwd().iterdir() if f.match("145592*.jpg")]
Achorn answered 17/9, 2019 at 19:55 Comment(1)
Alternatively, just use Path.cwd().glob("145592*.jpg")... Anyway this should definitely be higher on this page. pathlib is the way goFarfamed
D
10

use os.walk to recursively list your files

import os
root = "/home"
pattern = "145992"
alist_filter = ['jpg','bmp','png','gif'] 
path=os.path.join(root,"mydir_to_scan")
for r,d,f in os.walk(path):
    for file in f:
        if file[-3:] in alist_filter and pattern in file:
            print os.path.join(root,file)
Dehydrate answered 9/2, 2010 at 1:46 Comment(2)
No need to slice; file.endswith(alist_filter) is enough.Subcontract
We have to use any(file.endswith(filter) for filter in alist_filter) as endswith() does not allow list as a parameter.Fuegian
I
5
import os

dir="/path/to/dir"
[x[0]+"/"+f for x in os.walk(dir) for f in x[2] if f.endswith(".jpg")]

This will give you a list of jpg files with their full path. You can replace x[0]+"/"+f with f for just filenames. You can also replace f.endswith(".jpg") with whatever string condition you wish.

Impeachment answered 19/11, 2016 at 13:47 Comment(0)
M
4

you might also like a more high-level approach (I have implemented and packaged as findtools):

from findtools.find_files import (find_files, Match)


# Recursively find all *.txt files in **/home/**
txt_files_pattern = Match(filetype='f', name='*.txt')
found_files = find_files(path='/home', match=txt_files_pattern)

for found_file in found_files:
    print found_file

can be installed with

pip install findtools
Midway answered 29/5, 2014 at 22:13 Comment(0)
R
3

Filenames with "jpg" and "png" extensions in "path/to/images":

import os
accepted_extensions = ["jpg", "png"]
filenames = [fn for fn in os.listdir("path/to/images") if fn.split(".")[-1] in accepted_extensions]
Ruffi answered 22/3, 2018 at 18:38 Comment(1)
This is very similar to the answer given by @ramsey0Wernick
A
2

You can simplify it using List Comprehensions and a regex checker inside it to include image files with the specified postfix.

import re
import os

dir_name = "."
files = [os.path.join(dir_name, f) for f in os.listdir(dir_name) if re.match(r'.*\.(jpg|jpeg|png)', f)]
Aggrade answered 20/7, 2022 at 19:22 Comment(1)
Please, add a brief explanation of how/why it solves the problem.Membership
Z
1

You can define pattern and check for it. Here I have taken both start and end pattern and looking for them in the filename. FILES contains the list of all the files in a directory.

import os
PATTERN_START = "145592"
PATTERN_END = ".jpg"
CURRENT_DIR = os.path.dirname(os.path.realpath(__file__))
for r,d,FILES in os.walk(CURRENT_DIR):
    for FILE in FILES:
        if PATTERN_START in FILE.startwith(PATTERN_START) and PATTERN_END in FILE.endswith(PATTERN_END):
            print FILE
Zeuxis answered 14/11, 2019 at 5:42 Comment(3)
PATTERN_START should be used as FILE.startwith(PATTERN_START) and PATTERN_END should be used as FILE.endswith(PATTERN_END) to avoid any other file name combination. For example above code will allow jpg_sample_145592 file also. Which is not correct.Fuegian
I think it should be if FILE.startwith(PATTERN_START) and FILE.endswith(PATTERN_END):Fuegian
A Python solution using 2 for-loops is just not OK, not even in 2019. There are multiple ways of doing it without the nested for-loops. See e.g. solution by @ramsey0.Marcenemarcescent
A
-3

You can use subprocess.check_ouput() as

import subprocess

list_files = subprocess.check_output("ls 145992*.jpg", shell=True) 

Of course, the string between quotes can be anything you want to execute in the shell, and store the output.

Abalone answered 28/9, 2016 at 13:8 Comment(1)
Only one problem. ls's output should not be parsed.Camporee

© 2022 - 2024 — McMap. All rights reserved.