Why does python's os.walk() not reflect directory deletion?
Asked Answered
M

2

7

I'm attempting to write a Python function that will recursively delete all empty directories. This means that if directory "a" contains only "b", "b" should be deleted, then "a" should be deleted (since it now contains nothing). If a directory contains anything, it is skipped. Illustrated:

top/a/b/
top/c/d.txt
top/c/foo/

Given this, the three directories "b", "a", and "foo" should be deleted, as "foo" and "b" are empty now, and "a" will become empty after the deletion of "b".

I'm attempting to do this via os.walk and shutil.rmtree. Unfortunately, my code is only deleting the first level of directories, but not ones newly emptied in the process.

I'm using the topdown=false parameter of os.walk. The documentation for os.walk says that "If topdown is False, the triple for a directory is generated after the triples for all of its subdirectories (directories are generated bottom-up)." That's not what I'm seeing.

Here's my code:

for root, dirs, files in os.walk(".", topdown=False):
  contents = dirs+files
  print root,"contains:",contents
  if len(contents) == 0:
    print 'Removing "%s"'%root
    shutil.rmtree(root)
  else:
    print 'Not removing "%s". It has:'%root,contents

If I have the directory structure described above, here's what I get:

./c/foo contains: []
Removing "./c/foo"
./c contains: ['foo', 'd.txt']
Not removing "./c". It has: ['foo', 'd.txt']
./a/b contains: []
Removing "./a/b"
./a contains: ['b']
Not removing "./a". It has: ['b']
. contains: ['c', 'a']
Not removing ".". It has: ['c', 'a']

Note that, even though I've removed "b", "a" is not removed, thinking that it still contains "b". What I'm confused about is that the documentation for os.walk says that it generates the triple for "./a" after generating the triple for "b". My output suggests otherwise. Similar story for "./c". It shows that it still has "foo", even though I had deleted it right out of the gate.

What am I doing wrong? (I'm using Python 2.6.6.)

Mesquite answered 9/2, 2015 at 20:25 Comment(2)
I wouldn't expect os.walk to be updated on every iteration of the for loopChemoprophylaxis
I guess that's the key. The "before" and "after" in the documentation refer to the order in the resulting array output by os.walk(), not a temporal ordering of successive iterations through the for loop. The fact that the caller, in topdown=True mode, can modify the dirnames argument led me to think that iteration can be affected.Mesquite
C
10

The documentation has this ...

No matter the value of topdown, the list of subdirectories is retrieved before the tuples for the directory and its subdirectories are generated.

Chemoprophylaxis answered 9/2, 2015 at 20:46 Comment(1)
This is the best answer so far. It says that topdown=False becomes primarily a question of data ordering in the output of os.walk(), not temporal ordering of the underlying filesystem exploration.Mesquite
C
2

jcfollower's answer is absolutely correct about the cause of the issue you're encountering: The file system is always read top-down, even if the results are yielded from os.walk in a bottom-up manner. This means that the filesystem modifications you perform won't be reflected in the later results.

A solution to this issue is to maintain a set of the deleted directories, so that you can filter them out of their parent's list of subdirectories:

removed = set()                                               # first new line
for root, dirs, files in os.walk(".", topdown=False):
      dirs = [dir for dir in dirs if os.path.join(root, dir) not in removed] # second
      contents = dirs+files
      print root,"contains:",contents
      if len(contents) == 0:
          print 'Removing "%s"'%root
          shutil.rmtree(root)
          removed.add(root)                                   # third new line
      else:
          print 'Not removing "%s". It has:'%root,contents

There are three new lines. The first, at the top, creates an empty removed set to contain the removed directories. The second replaces the dirs list with a new list that doesn't include any subdirectories that are in the removed set, since they were deleted in a previous step. The last new line adds the current directory to the set when has been removed.

Central answered 9/2, 2015 at 22:28 Comment(1)
That's a neat trick! Very clever. It acknowledges that os.walk() is going to give you information that has possibly been invalidated by the deletions and explicitly modifies what it returns.Mesquite

© 2022 - 2024 — McMap. All rights reserved.