How to determine if a path is a subdirectory of another?
Asked Answered
S

7

6

I am given a list of paths that I need to check files within. Of course, if I am given a root, and a subdirectory, there is no need to process the sub-directory. For example

c:\test  // process this
c:\test\pics // do not process this
c:\test2 // process this

How can I tell (cross platform) that a path is not a subdirectory of the other. Preferably I would want this to be cross platform, and am not worried about symlinks as long as they are not cyclical (worse case is that I end up processing the data twice).

Shornick answered 13/1, 2012 at 17:8 Comment(0)
B
7

I would maintain a set of directories you have already processed, and then for each new path check to see if any of its parent directories already exist in that set before processing:

import os.path

visited = set()
for path in path_list:
    head, tail = os.path.split(path)
    while head and tail:
        if head in visited:
            break
        head, tail = os.path.split(head)
    else:
        process(path)
        visited.add(path)

Note that path_list should be sorted so that subdirectories are always after their parent directories if they exist.

Basilica answered 13/1, 2012 at 17:18 Comment(7)
This will be faster than my suggestion because it does set membership tests rather than scanning a list. I like it.Saccharose
@F.J seems to be an infitie loop, head reduces to c:\ at its very base and never gets set to None.Shornick
@Shornick - Sorry about that, I had thought that in the base case it would put everything into tail, not head. See my edit which should fix the problem.Basilica
@F.J still an infinite loop, if head is in the visited list, the head pointer is never updated, so it will be stuck in continue foreverShornick
@Shornick - Right you are, sorry about the sloppy code! It should actually be fixed now.Basilica
@F.J no worries, i already got it working with a boolean if, but you just taught me something new with the else, pretty cool!Shornick
using visited seems rather risky, especially for long running processes, where it may grow beyond the systems memory.Genuflection
F
8
def is_subdir(path, directory):
    path = os.path.realpath(path)
    directory = os.path.realpath(directory)

    relative = os.path.relpath(path, directory)

    if relative.startswith(os.pardir):
        return False
    else:
        return True
Fuzzy answered 7/8, 2013 at 23:51 Comment(1)
Using relpath can fail on ms-windows, when attempting to find C:\foo relative to D:\bar.Genuflection
B
7

I would maintain a set of directories you have already processed, and then for each new path check to see if any of its parent directories already exist in that set before processing:

import os.path

visited = set()
for path in path_list:
    head, tail = os.path.split(path)
    while head and tail:
        if head in visited:
            break
        head, tail = os.path.split(head)
    else:
        process(path)
        visited.add(path)

Note that path_list should be sorted so that subdirectories are always after their parent directories if they exist.

Basilica answered 13/1, 2012 at 17:18 Comment(7)
This will be faster than my suggestion because it does set membership tests rather than scanning a list. I like it.Saccharose
@F.J seems to be an infitie loop, head reduces to c:\ at its very base and never gets set to None.Shornick
@Shornick - Sorry about that, I had thought that in the base case it would put everything into tail, not head. See my edit which should fix the problem.Basilica
@F.J still an infinite loop, if head is in the visited list, the head pointer is never updated, so it will be stuck in continue foreverShornick
@Shornick - Right you are, sorry about the sloppy code! It should actually be fixed now.Basilica
@F.J no worries, i already got it working with a boolean if, but you just taught me something new with the else, pretty cool!Shornick
using visited seems rather risky, especially for long running processes, where it may grow beyond the systems memory.Genuflection
S
2

Track the directories you've already processed (in a normalized form) and don't process them again if you've already seen them. Something like this should work:

from os.path import realpath, normcase, sep

dirs = [r"C:\test", r"C:\test\pics", r"C:\test2"]

processed = []

for dir in dirs:
    dir = normcase(realpath(dir)) + sep
    if not any(dir.startswith(p) for p in processed):
        processed.append(dir)
        process(dir)            # your code here
Saccharose answered 13/1, 2012 at 17:23 Comment(4)
commonprefix([r'C:\test2', r'C:\test']) --> 'C:\\test'Basilica
Yeah, sigh. That doesn't really do what it should, IMHO. Changed it to just do a simple startswith() -- that'll be fine since it's normalized.Saccharose
Yeah, I really think the behavior of commonprefix is kind of weird, it seems like it should only check at directory breaks, since it comes from the os.path module, oh well.Basilica
@F.J. Just as a point of interest: commonprefix accepts a list of any kind of sequences (not just strings). So a list of lists containing path components will work as expected (although this is probably not by design).Taboret
C
0

Fixed and simplified jgoeders's version:

def is_subdir(suspect_child, suspect_parent):
    suspect_child = os.path.realpath(suspect_child)
    suspect_parent = os.path.realpath(suspect_parent)

    relative = os.path.relpath(suspect_child, start=suspect_parent)

    return not relative.startswith(os.pardir)
Cryptogam answered 29/4, 2014 at 5:11 Comment(3)
Why not just edit the post instead of adding another duplicate answer?Fuzzy
Sorry, I post on SO once in a blue Moon, and missed this feature.Cryptogam
realpath - eliminating symbolic links here is problematic. (in many cases not what you want), since following links can change the path layout completely.Genuflection
G
0

Here is an is_subdir utility function I came up with.

  • Python3.x compatible (works with bytes and str, matching os.path which also supports both).
  • Normalizes paths for comparison.
    (parent hierarchy and case to work on ms-windows).
  • Avoids using os.path.relpath which will raise an exception on ms-windows if the paths are on different drives. (C:\foo -> D:\bar)

Code:

def is_subdir(path, directory):
    """
    Returns true if *path* in a subdirectory of *directory*.
    """
    import os
    from os.path import normpath, normcase, sep
    path = normpath(normcase(path))
    directory = normpath(normcase(directory))
    if len(path) > len(directory):
        sep = sep.encode('ascii') if isinstance(directory, bytes) else sep
        if path.startswith(directory.rstrip(sep) + sep):
            return True
    return False
Genuflection answered 26/3, 2015 at 5:38 Comment(0)
S
0

Here is the solution I used based off Andrew Clarks answer, making sure the list is sorted so that children are under parents, and using normpath and normcase to fix paths that refer to the same location such as c:\users and c:/users in Windows.

  def unique_path_roots(paths):
    visited = set()
    paths = list(set(paths))

    for path in sorted(paths,key=cmp_to_key(locale.strcoll)):
        path = normcase(normpath(realpath(path)))

        head, tail = os.path.split(path)
        while head and tail:
            if head in visited:
                break
            head, tail = os.path.split(head)
        else:
            yield path
            visited.add(path)
Shornick answered 12/4, 2023 at 17:14 Comment(0)
F
0

I suspect you could build something like this using Path("x").is_relative_to

Fount answered 11/1 at 21:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.