How to circumvent the fallacy of Python's os.path.commonprefix?
Asked Answered
M

5

26

My problem is to find the common path prefix of a given set of files.

Literally I was expecting that "os.path.commonprefix" would do just that. Unfortunately, the fact that commonprefix is located in path is rather misleading, since it actually will search for string prefixes.

The question to me is, how can this actually be solved for paths? The issue was briefly mentioned in this (fairly high rated) answer but only as a side-note and the proposed solution (appending slashes to the input of commonprefix) imho has issues, since it will fail for instance for:

os.path.commonprefix(['/usr/var1/log/', '/usr/var2/log/'])
# returns /usr/var but it should be /usr

To prevent others from falling into the same trap, it might be worthwhile to discuss this issue in a separate question: Is there a simple / portable solution for this problem that does not rely on nasty checks on the file system (i.e., access the result of commonprefix and check whether it is a directory and if not returns a os.path.dirname of the result)?

Meteoric answered 1/2, 2014 at 14:0 Comment(2)
Related issue: bugs.python.org/issue10395. A patch is in the pipe line.Rhodic
@bluenote10, Maybe we could changed accepted answer to the one for "cjac"?Eutherian
T
24

It seems that this issue has been corrected in recent versions of Python. New in version 3.5 is the function os.path.commonpath(), which returns the common path instead of the common string prefix.

Turnbow answered 23/3, 2016 at 20:2 Comment(1)
Thanks! This is Yet Another Reason to switch to python 3 -- glad I finally did it last year.Millinery
A
16

Awhile ago I ran into this where os.path.commonprefix is a string prefix and not a path prefix as would be expected. So I wrote the following:

def commonprefix(l):
    # this unlike the os.path.commonprefix version
    # always returns path prefixes as it compares
    # path component wise
    cp = []
    ls = [p.split('/') for p in l]
    ml = min( len(p) for p in ls )

    for i in range(ml):

        s = set( p[i] for p in ls )         
        if len(s) != 1:
            break

        cp.append(s.pop())

    return '/'.join(cp)

it could be made more portable by replacing '/' with os.path.sep.

Aloise answered 1/2, 2014 at 15:5 Comment(5)
I accepted this answer since it is fairly robust while maintaining good brevity. It is interesting to see that in contrast to a solution based on os.path.commenprefix (see Dan Getz's answer), the component-wise comparison does not depend on whether the input paths are file or directory names (simply because filenames are unique components). For a more robust approach I recommend EOL's answer and to learn more about the problems of os.path.commonprefix Dan Getz' answer is very instructive. Thank you all!Meteoric
This gives a reasonable answer with both file or directory names, but in the general case you can't know for certain if the result of the function is a file or directory name without doing more work (such as by controlling the input). Also, I believe this returns the exact same result as os.path.dirname(os.path.commonprefix([p + '/' for p in l]))?Amaze
@DanGetz It apparently does not do the same. I just tried it on windows by replacing it with os.path.sep and the code provided by @DanD results in the correct common path while your snippet returns NoneThimbu
@Thimbu what inputs cause that? I was unaware dirname could return None.Amaze
@DanGetz I must have messed it up somewhere. I just tried it again in a fresh environment and it works. My bad.Thimbu
A
7

Assuming you want the common directory path, one way is to:

  1. Use only directory paths as input. If your input value is a file name, call os.path.dirname(filename) to get its directory path.
  2. "Normalize" all the paths so that they are relative to the same thing and don't include double separators. The easiest way to do this is by calling os.path.abspath( ) to get the path relative to the root. (You might also want to use os.path.realpath( ) to remove symbolic links.)
  3. Add a final separator (found portably with os.path.sep or os.sep) to the end of all the normalized directory paths.
  4. Call os.path.dirname( ) on the result of os.path.commonprefix( ).

In code (without removing symbolic links):

def common_path(directories):
    norm_paths = [os.path.abspath(p) + os.path.sep for p in directories]
    return os.path.dirname(os.path.commonprefix(norm_paths))

def common_path_of_filenames(filenames):
    return common_path([os.path.dirname(f) for f in filenames])
Amaze answered 1/2, 2014 at 14:56 Comment(10)
When I was thinking about this idea I rejected it initially, because I though appending a slash is a problem when working with relative paths, like a blank file name in the current directory. However, when wrapping all paths into os.path.abspath, even a mixture of relative and absolute paths should be no problem, right?Meteoric
@Meteoric Right, abspath instead of normpath will handle the relative paths. Not sure about blank file names, because that's ambiguous with duplicated path separators. Does anyone allow zero-length file names?Amaze
Oh, I didn't mean empty file names, just "aSimpleFileName".Meteoric
One should probably mention that it all depends on whether paths are file paths or directory paths (this must be a convention in case we want to avoid file system access). In my problem paths would be in fact files not directories. I think in this case the proper order is: (a) convert to abspath to deal with mixture of relative/absolute paths; (b) apply dirname to convert to a proper directory. From that point I think we can safely apply your solution.Meteoric
Oh, now I see it, good point. You need to know in advance if your string is a file or directory path. I thought I was getting around that, but instead I was just assuming directories.Amaze
I think it is important to mention that one cannot simply take the dirname on the input. I tried to edit the question to make that clear. Feel free to revert if I screwed things up :).Meteoric
Hey, looks like someone already reverted it. I looked at what you wrote, and tried to rewrite my answer to make it clearer about how you need to be careful to not use file paths (that is, to get the directory path first).Amaze
In my edit I tried to explain why what you suggested does not work for files. I think it is not possible to call dirname on the input to common_path. This would discard information on relative paths, because abspath(dirname("somefile.txt")) != dirname(abspath("somefile.txt")). Imho it is necessary to have a separate version of common_path, which internally does first abspath then dirname (the other way around compared to applying dirname to the argument). Don't know why my edit was discarded :(.Meteoric
For a relative path, I see no problem. Are you talking about file links? Do you have an example where it really is true that abspath(dirname(x)) != dirname(abspath(x))? They're equal for the example you gave.Amaze
You're right! I was wrongly assuming that abspath of an empty string would be evaluated to "the absolute path that does not even contain the root slash", i.e., another empty string. I'm glad I finally see the cause of my confusion. Thanks for making that clear!Meteoric
P
2

A robust approach is to split the path into individual components and then find the longest common prefix of the component lists.

Here is an implementation which is cross-platform and can be generalized easily to more than two paths:

import os.path
import itertools

def components(path):
    '''
    Returns the individual components of the given file path
    string (for the local operating system).

    The returned components, when joined with os.path.join(), point to
    the same location as the original path.
    '''
    components = []
    # The loop guarantees that the returned components can be
    # os.path.joined with the path separator and point to the same
    # location:    
    while True:
        (new_path, tail) = os.path.split(path)  # Works on any platform
        components.append(tail)        
        if new_path == path:  # Root (including drive, on Windows) reached
            break
        path = new_path
    components.append(new_path)

    components.reverse()  # First component first 
    return components

def longest_prefix(iter0, iter1):
    '''
    Returns the longest common prefix of the given two iterables.
    '''
    longest_prefix = []
    for (elmt0, elmt1) in itertools.izip(iter0, iter1):
        if elmt0 != elmt1:
            break
        longest_prefix.append(elmt0)
    return longest_prefix

def common_prefix_path(path0, path1):
    return os.path.join(*longest_prefix(components(path0), components(path1)))

# For Unix:
assert common_prefix_path('/', '/usr') == '/'
assert common_prefix_path('/usr/var1/log/', '/usr/var2/log/') == '/usr'
assert common_prefix_path('/usr/var/log1/', '/usr/var/log2/') == '/usr/var'
assert common_prefix_path('/usr/var/log', '/usr/var/log2') == '/usr/var'
assert common_prefix_path('/usr/var/log', '/usr/var/log') == '/usr/var/log'
# Only for Windows:
# assert common_prefix_path(r'C:\Programs\Me', r'C:\Programs') == r'C:\Programs'
Pemphigus answered 3/2, 2014 at 11:30 Comment(0)
V
2

I've made a small python package commonpath to find common paths from a list. Comes with a few nice options.

https://github.com/faph/Common-Path

Valve answered 30/9, 2015 at 9:4 Comment(1)
Cool little package! :)Eutherian

© 2022 - 2024 — McMap. All rights reserved.