How to get progress of os.walk in python?
Asked Answered
P

10

14

I have a piece of code which I'm using to search for the executables of game files and returning the directories. I would really like to get some sort of progress indicator as to how far along os.walk is. How would I accomplish such a thing?

I tried doing startpt = root.count(os.sep) and gauging off of that but that just gives how deep os.walk is in a directory tree.

def locate(filelist, root=os.curdir): #Find a list of files, return directories.
    for path, dirs, files in os.walk(os.path.abspath(root)):
        for filename in returnMatches(filelist, [k.lower() for k in files]):
            yield path + "\\"
Physicalism answered 29/1, 2010 at 18:58 Comment(3)
the real question is why is your os.walk taking so long? how many files are you muddling through? what is the performance of the returnMatches?Claycomb
def returnMatches(a,b): return list(set(a) & set(b)) #Returns a list, of matches between given lists. That's all returnMatches is...this only takes a couple of seconds to complete but I'm adding polish to the program so to people it doesn't look like my program is just doing nothing for a couple seconds. On MY machine the entire thing takes about 10 seconds to complete. But this is going to be packaged up and running on any number of windows machines/environmentsPhysicalism
Note about my machine: Still running a very very slow IDE drive. ;)Physicalism
P
6

I figured this out.

I used os.listdir to get a list of toplevel directories, and then used the .split function on the path that os.walk returned, returning the first level directory that it was currently in.

That left me with a list of toplevel directories, which I could find the index of the current directory of os.walk, and compare the index returned with the length of the list, giving me a % complete. ;)

This doesn't give me a smooth progress, because the level of work done in each directory can vary but smoothing out the progress indicator is of no concern for me. But it could easily be accomplished by extending the path checking deeper into the directory structure.

Here is the final code from getting my progress:

def locateGameDirs(filelist, root=os.curdir): #Find a list of files, return directories.
    toplevel = [folder for folder in os.listdir(root) if os.path.isdir(os.path.join(root, folder))] #List of top-level directories
    fileset = set(filelist)

    for path, dirs, files in os.walk(os.path.abspath(root)):

        curdir = path.split('\\')[1] #The directory os.walk is currently in.

        try: #Thrown here because there's a nonexistant(?) first entry.
            youarehere = toplevel.index(curdir)
            progress = int(((youarehere)/len(toplevel))*100)
        except:
            pass

        for filename in returnMatches(filelist, [k.lower() for k in files]):
            yield filename, path + "\\", progress

And right now for debugging purposes I'm doing this further in the code:

    for wow in locateGameDirs(["wow.exe", "firefox.exe", "vlc.exe"], "C:\\"):
    print wow

Is there a nice little way to get rid of that try/except?; it seems the first iteration of path gives me nothing...

Physicalism answered 29/1, 2010 at 20:49 Comment(1)
The first iteration gives you the root. Try adding "print path" to see what I mean.Sisyphean
F
5

It depends!

If the files and directories are distributed more or less evenly you could show rough process by assuming every toplevel directory is going to take the same amount of time. But if they are not distributed evenly you cannot find out about it cheaply. You either have to know roughly how populated every directory is in advance, or you have to os.walk the entire thing twice (but that is only useful if your actual processing takes much longer than the os.walk itself does).

That is: say you have 4 toplevel directories, and each one contains 4 files. If you assume every toplevel dir takes 25% of progress, and each file takes another 25% of the progress for that dir, you can show a nice progress indicator. But if the last subdir turns out to contain many more files than the first few your progress indicator will have hit 75% before you find out about it. You cannot really fix that if the os.walk itself is the bottleneck (not your processing) and it's an arbitrary directory tree (not one where you know in advance roughly how long every subtree is going to take).

And of course that's assuming the cost here is about the same for every file...

Forester answered 29/1, 2010 at 19:29 Comment(0)
T
4

Just show an indeterminate progress bar (i.e. the ones that show a blob bouncing back and forth or the barber pole effect). That way users know that the program is doing something useful but doesn't mislead them as far as time to complete and such.

Thereinto answered 29/1, 2010 at 19:45 Comment(1)
Even though I figured out my problem, since the operation is so short your probably right on this. Thanks ;)Physicalism
C
2

Do it in two passes: first count how many total files/folders are in the tree, and then during the second pass do actual processing.

Combative answered 29/1, 2010 at 19:4 Comment(5)
Wouldn't this take twice as long?Easterly
That's only helpful if the processing takes significantly more time than walking the tree does. If the OP is opening each file, then it probably does. If the OP is just looking at some detail of the name then it almost certainly doesn't.Homozygote
@Omnifarious: then it's not clear why would he wanted to know the progress, since it'll cost more than actual processing.Claycomb
Maybe I can get a list of the top level directories, and compare that to the top level of what os.walk is looking at? Would that require some sort of string parsing though? Or could I just use .split?Physicalism
@ThantiK, of course you could do that... but it's very specific to your own situation then. In some cases, the "top level" dir might have only one subfolder in it. In other cases, one subfolder could have 99% of the work in it, but be one of only 50 subfolders in the top one. In most cases this idea will not give useful results, though if it does for you then certainly you can do it.Hadley
U
0

You need to know the total number of files to do a meaningful progress indicator.
You can get the number of files like this

len(list(os.walk(os.path.abspath(root))))

but that is going to take some time and you probably need a progress indicator for that...

To find the number of files really quickly you'd need a filesystem which keeps track of the number of files for you.

Perhaps you can save the total from a previous run and use that as an estimate

Urge answered 29/1, 2010 at 19:19 Comment(2)
I don't care about the number of files. Honestly I would be happy just knowing which top directory it is in out of all the top directories. For example I have top directories named C:\\1, C:\\2, and so on... Just saying 'Your on top-level directory x out of x' would be fine, I just don't know how to pull it off.Physicalism
I worked this out for at least getting my top level dirs: [folder for folder in os.listdir('C:\\') if os.path.isdir(os.path.join('C:\\ ', folder))] Now how would I go about figuring out where os.walk is?Physicalism
S
0

I suggest you avoid walking the directory. Instead use an indexed-based app for quickly finding files. You can use the app's command-line interface via subprocess and find the files almost instantaneously.

On Windows, see Everything. On UNIX, check out locate. Not sure about Mac, but I'm sure there's an option there too.

Sisyphean answered 29/1, 2010 at 20:22 Comment(5)
This is going to be a packaged executable that is being passed out to people. Not for personal use. I cannot be using things like this.Physicalism
Couldn't you just ship the search app along with your program? Possibly aided by an installer? If you really want to walk, the only options I see have already been suggested: doing two walks (one for counting, one for the actual operation), or an indeterminate progress bar that you tick after every x number of iterations.Sisyphean
No. The program is basically just a 4mb executable packed with py2exe, no reason to install a program that just searches for a list of installed games and uploads the save game files to a server.Physicalism
All it seems Everything does is...well, the same thing I'm doing. But even that program takes approx the same time as mine does to index the files. The thing is, once this is all done I'll save the settings so it'll only have to walk the directories on first startup, and if the user choses to scan for games again. Mainly I just wanted to know how to do it. Not if it was economical, dirty, or how I could speed up the operation.Physicalism
Well Everything also keeps its index up to date by monitoring changes to the filesystem so you only have to build the index once and you don't have to worry about the settings getting out of sync. But yeah, that's probably overkill in this case.Sisyphean
C
0

as I said in the comment, the performance bottle neck likely lies outside of the locate function. your returnMatches is a fairly expensive function. I think you'd be better off replacing it with the following code:

def locate(filelist, root=os.curdir)
    fileset = set(filelist)            # if possible, pass the set instead of the list as a first argument
    for path, dirs, files in os.walk(os.path.abspath(root)):
            if any(file.lower() in fileset for file in files):
                yield path + '\\'

This way you reduce the number of wasteful operations, yield once per file in the directory (which I think is what you actually indented to do) and you can forget about the progress at the same time. I don't think that progress would be an expected feature of the interface anyway.

Claycomb answered 29/1, 2010 at 20:30 Comment(2)
def returnMatches(a,b): return list(set(a) & set(b)) And I tried your method posted here. It didn't work any faster.Physicalism
@ThantiK: it only means that the bulk of time is spent by os.walk itself. It doesn't make your approach any efficient.Claycomb
E
0

Thinking out of the box here...what if you did it based on size:

  • Use subprocess to run 'du -sb' and get the total_size of your root directory
  • As you walk, check the size of each file and decrement from your total_size (giving you remaining_size)
  • pct_complete = (total_size - remaining_size)/total_size

Thoughts?

-aj

Easterly answered 29/1, 2010 at 20:31 Comment(0)
R
0

One optimisation you could do - you are converting filelist into a set on every call to returnMatches, even though it never changes. move the conversion to the start of the 'locate' function and pass the set in on every iteration.

Routinize answered 29/1, 2010 at 20:54 Comment(1)
Thanks - I actually took this from SilentGhosts post, even though he was worried more about performance than the task at hand ;)Physicalism
H
0

Well, this was fun. Here is another silly way of doing it, but as everything else, it only calculates the right progress for uniform paths.

import os, sys, time

def calc_progress(progress, root, dirs):
    prog_start, prog_end, prog_slice = 0.0, 1.0, 1.0

    current_progress = 0.0
    parent_path, current_name = os.path.split(root)
    data = progress.get(parent_path)
    if data:
        prog_start, prog_end, subdirs = data
        i = subdirs.index(current_name)
        prog_slice = (prog_end - prog_start) / len(subdirs)
        current_progress = prog_slice * i + prog_start

        if i == (len(subdirs) - 1):
            del progress[parent_path]

    if dirs:
        progress[root] = (current_progress, current_progress+prog_slice, dirs)

    return current_progress

def walk(start_root):
    progress = {}
    print 'Starting with {start_root}'.format(**locals())

    for root, dirs, files in os.walk(start_root):
        print '{0}: {1:%}'.format(root[len(start_root)+1:], calc_progress(progress, root, dirs))
Hyozo answered 29/1, 2010 at 21:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.