os.walk to crawl through folder structure
Asked Answered
C

4

6

I have some code that looks at a single folder and pulls out files. but now the folder structure has changed and i need to trawl throught the folders looking for files that match.

what the old code looks like

GSB_FOLDER = r'D:\Games\Gratuitous Space Battles Beta' 

def get_module_data():
    module_folder = os.path.join(GSB_FOLDER, 'data', 'modules')

    filenames = [os.path.join(module_folder, f) for f in
                  os.listdir(module_folder)]

    data = [parse_file(f) for f in filenames]

    return data

But now the folder structure has changed to be like this

  • GSB_FOLDER\data\modules
    • \folder1\data\modules
    • \folder2\data\modules
    • \folder3\data\modules

where folder1,2 or 3, could be any text string

how do i rewrite the code above to do this... I have been told about os.walk but I'm just learning Python... so any help appreciated

Chorion answered 30/10, 2012 at 0:13 Comment(0)
B
10

Nothing much changes you just call os.walk and it will recursively go thru the directory and return files e.g.

for root, dirs, files in os.walk('/tmp'):
    if os.path.basename(root) != 'modules':
        continue
    data = [parse_file(os.path.join(root,f)) for f in files]

Here I am checking files only in folders named 'modules' you can change that check to do something else, e.g. paths which have module somewhere root.find('/modules') >= 0

Bortman answered 30/10, 2012 at 0:19 Comment(2)
Is there a way to only include folders that are named modules.Chorion
@MatGritt see change, you can do filtering based on root pathBortman
S
2

Created a function that kind of serves a general purpose of crawling through directory structure and returning files and/or paths that match pattern.

import os
import re

def directory_spider(input_dir, path_pattern="", file_pattern="", maxResults=500):
    file_paths = []
    if not os.path.exists(input_dir):
        raise FileNotFoundError("Could not find path: %s"%(input_dir))
    for dirpath, dirnames, filenames in os.walk(input_dir):
        if re.search(path_pattern, dirpath):
            file_list = [item for item in filenames if re.search(file_pattern,item)]
            file_path_list = [os.path.join(dirpath, item) for item in file_list]
            file_paths += file_path_list
            if len(file_paths) > maxResults:
                break
    return file_paths[0:maxResults]

Example usages:

  • directory_spider('/path/to/find') --> Finds the top 500 files in the path if it exists
  • directory_spider('/path/to/find',path_pattern="",file_pattern=".py$", maxResults=10)
Soutache answered 17/4, 2019 at 21:48 Comment(1)
thanks! fyi sys is imported but unusedExtremadura
M
1

os.walk is a nice easy way to get the directory structure of everything inside a dir you pass it;

in your example, you could do something like this:

for dirpath, dirnames, filenames in os.walk("...GSB_FOLDER"):
  #whatever you want to do with these folders
  if "/data/modules/" in dirpath:
    print dirpath, dirnames, filenames

try that out, should be fairly self explanatory how it works...

Melendez answered 30/10, 2012 at 0:34 Comment(0)
B
0

You can use os.walk like @Anurag has detailed or you can try my small pathfinder library:

data = [parse_file(f) for f in pathfinder.find(GSB_FOLDER), just_files=True]
Breather answered 30/10, 2012 at 0:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.