In python on OSX with HFS+ how can I get the correct case of an existing filename?

Asked 25/1, 2013 at 3:55 Answered 14/8, 2015 at 2:2

I am storing data about files that exist on a OSX HFS+ filesystem. I later want to iterate over the stored data and figure out if each file still exists. For my purposes, I care about filename case sensitivity, so if the case of a filename has changed I would consider the file to no longer exist.

I started out by trying

os.path.isfile(filename)

but on a normal install of OSX on HFS+, this returns True even if the filename case does not match. I am looking for a way to write a isfile() function that cares about case even when the filesystem does not.

os.path.normcase() and os.path.realpath() both return the filename in whatever case I pass into them.

Edit:

I now have two functions that seem to work on filenames limited to ASCII. I don't know how unicode or other characters might affect this.

The first is based off answers given here by omz and Alex L.

def does_file_exist_case_sensitive1a(fname):
    if not os.path.isfile(fname): return False
    path, filename = os.path.split(fname)
    search_path = '.' if path == '' else path
    for name in os.listdir(search_path):
        if name == filename : return True
    return False

The second is probably even less efficient.

def does_file_exist_case_sensitive2(fname):
    if not os.path.isfile(fname): return False
    m = re.search('[a-zA-Z][^a-zA-Z]*\Z', fname)
    if m:
        test = string.replace(fname, fname[m.start()], '?', 1)
        print test
        actual = glob.glob(test)
        return len(actual) == 1 and actual[0] == fname
    else:
        return True  # no letters in file, case sensitivity doesn't matter

Here is a third based off DSM's answer.

def does_file_exist_case_sensitive3(fname):
    if not os.path.isfile(fname): return False
    path, filename = os.path.split(fname)
    search_path = '.' if path == '' else path
    inodes = {os.stat(x).st_ino: x for x in os.listdir(search_path)}
    return inodes[os.stat(fname).st_ino] == filename

I don't expect that these will perform well if I have thousands of files in a single directory. I'm still hoping for something that feels more efficient.

Another shortcoming I noticed while testing these is that they only check the filename for a case match. If I pass them a path that includes directory names none of these functions so far check the case of the directory names.

Pilkington answered 25/1, 2013 at 3:55 Comment(2)

I don't know if it is an option for your application, but it is possible to convert HFS+ to a case sensitive file system. Or you can use UFS. – Procryptic 25/1, 2013 at 4:59

In my situation changing the filesystem type is not an option. – Pilkington 25/1, 2013 at 5:9

This answer complements the existing ones by providing functions, adapted from Alex L's answer, that:

also work with non-ASCII characters
process all path components (not just the last)
work with both Python 2.x and 3.x
as a bonus, also work on Windows (there are better Windows-specific solutions - see https://mcmap.net/q/454340/-python-getting-filename-case-as-stored-in-windows - but the functions here are cross-platform and require no additional packages)

import os, unicodedata

def gettruecasepath(path): # IMPORTANT: <path> must be a Unicode string
  if not os.path.lexists(path): # use lexists to also find broken symlinks
    raise OSError(2, u'No such file or directory', path)
  isosx = sys.platform == u'darwin'
  if isosx: # convert to NFD for comparison with os.listdir() results
    path = unicodedata.normalize('NFD', path)
  parentpath, leaf = os.path.split(path)
  # find true case of leaf component
  if leaf not in [ u'.', u'..' ]: # skip . and .. components
    leaf_lower = leaf.lower() # if you use Py3.3+: change .lower() to .casefold()
    found = False
    for leaf in os.listdir(u'.' if parentpath == u'' else parentpath):
      if leaf_lower == leaf.lower(): # see .casefold() comment above
          found = True
          if isosx:
            leaf = unicodedata.normalize('NFC', leaf) # convert to NFC for return value
          break
    if not found:
      # should only happen if the path was just deleted
      raise OSError(2, u'Unexpectedly not found in ' + parentpath, leaf_lower)
  # recurse on parent path
  if parentpath not in [ u'', u'.', u'..', u'/', u'\\' ] and \
                not (sys.platform == u'win32' and 
                     os.path.splitdrive(parentpath)[1] in [ u'\\', u'/' ]):
      parentpath = gettruecasepath(parentpath) # recurse
  return os.path.join(parentpath, leaf)


def istruecasepath(path): # IMPORTANT: <path> must be a Unicode string
  return gettruecasepath(path) == unicodedata.normalize('NFC', path)

gettruecasepath() gets the case-exact representation as stored in the filesystem of the specified path (absolute or relative) path, if it exists:
- The input path must be a Unicode string:
  - Python 3.x: strings are natively Unicode - no extra action needed.
  - Python 2.x: literals: prefix with u; e.g., u'Motörhead'; str variables: convert with, e.g., strVar.decode('utf8')
- The string returned is a Unicode string in NFC (composed normal form). NFC is returned even on OSX, where the filesystem (HFS+) stores names in NFD (decomposed normal form).
  NFC is returned, because it is far more common than NFD, and Python doesn't recognize equivalent NFC and NFD strings as (conceptually) identical. See below for background information.
- The path returned retains the structure of the input path (relative vs. absolute, components such as . and ..), except that multiple path separators are collapsed, and, on Windows, the returned path always uses \ as the path separator.
- On Windows, a drive / UNC-share component, if present, is retained as-is.
- An OSError exception is thrown if the path does not exist, or if you do not have permission to access it.
- If you use this function on a case-sensitive filesystem, e.g., on Linux with ext4, it effectively degrades to indicating whether the input path exists in the exact case specified or not.
istruecasepath() uses gettruecasepath() to compare the input path to the path as stored in the filesystem.

Caveat: Since these functions need to examine all directory entries at every level of the input path (as specified), they will be slow - unpredictably so, as performance will correspond to how many items the directories examined contain. Read on for background information.

Background

Native API support (lack thereof)

It is curious that neither OSX nor Windows provide a native API method that directly solves this problem.

While on Windows you can cleverly combine two API methods to solve the problem, on OSX there is no alternative that I'm aware of to the - unpredictably - slow enumeration of directory contents on each level of the path examined, as employed above.

Unicode normal forms: NFC vs. NFD

HFS+ (OSX' filesystem) stores filenames in decomposed Unicode form (NFD), which causes problems when comparing such names to in-memory Unicode strings in most programming languages, which are usually in composed Unicode form (NFC).

For instance, a path with non-ASCII character ü that you specify as a literal in your source code will be represented as single Unicode codepoint, U+00FC; this is an example of NFC: the 'C' stands for composed, because the letter base letter u and its diacritic ¨ (a combining diaeresis) form a single letter.

By contrast, if you use ü as a part of an HFS+ filename, it is translated to NFD form, which results in 2 Unicode codepoints: the base letter u (U+0075), followed by the combining diaeresis (̈, U+0308) as a separate codepoint; the 'D' stands for decomposed, because the character is decomposed into the base letter and its associated diacritic.

Even though the Unicode standard deems these 2 representations (canonically) equivalent, most programming languages, including Python, do not recognize such equivalence.
In the case of Python, you must use unicodedata.normalize() to convert both strings to the same form before comparing.

(Side note: Unicode normal forms are separate from Unicode encodings, though the differing numbers of Unicode code points typically also impact the number of bytes needed to encode each form. In the example above, the single-codepoint ü (NFC) requires 2 bytes to encode in UTF-8 (U+00FC -> 0xC3 0xBC), whereas the two-codepoint ü (NFD) requires 3 bytes (U+0075 -> 0x75, and U+0308 -> 0xCC 0x88)).

Dippold answered 14/8, 2015 at 2:2 Comment(1)

Holy crap, not exactly the same problem, but I traced my issue to os.listdir. I was writing queries based on file name paths that some included Unicode characters. MySQL had the file names stored correctly(from a Linux system). My code needs to run on multiple platforms, so I took experts of your code to fit my needs and voilà! Thanks for your answer. – Smearcase 19/8, 2019 at 3:18

Following on from omz's post - something like this might work:

import os

def getcase(filepath):
    path, filename = os.path.split(filepath)
    for fname in os.listdir(path):
        if filename.lower() == fname.lower():
            return os.path.join(path, fname)

print getcase('/usr/myfile.txt')

Ofelia answered 25/1, 2013 at 4:24 Comment(0)

Here's a crazy thought I had. Disclaimer: I don't know nearly enough about filesystems to consider edge cases, so take this merely as something which happened to work. Once.

>>> !ls
A.txt   b.txt
>>> inodes = {os.stat(x).st_ino: x for x in os.listdir(".")}
>>> inodes
{80827580: 'A.txt', 80827581: 'b.txt'}
>>> inodes[os.stat("A.txt").st_ino]
'A.txt'
>>> inodes[os.stat("a.txt").st_ino]
'A.txt'
>>> inodes[os.stat("B.txt").st_ino]
'b.txt'
>>> inodes[os.stat("b.txt").st_ino]
'b.txt'

Washtub answered 25/1, 2013 at 4:50 Comment(2)

+1 for a clever and concise solution. As for the edge cases: if you use os.lstat() rather than os.stat(), you'll also handle (broken) symlinks correctly. Another problem is that using os.[l]stat() on all directory entries requires more permissions than just enumerating them with os.listdir(), which could break if you happen to encounter an (unrelated) entry you're not permitted to stat. – Dippold 14/8, 2015 at 2:40

On the other hand, your solution has one distinct advantage: By delegating the case-insensitive lookup of the filename to os.stat(), you let the system deal with any potential NFC/NFD Unicode form discrepancy, which it knows how to handle. (HFS+ (on OSX) stores filenames in decomposed Unicode form (NFD), which causes problems when comparing to in-memory strings in most programming languages, including Python, which typically use composed Unicode form (NFC). However, note that the filenames reported (via os.listdir()) will still be in NFD. – Dippold 14/8, 2015 at 2:41

You could use something like os.listdir and check if the list contains the file name you're looking for.

Depart answered 25/1, 2013 at 4:14 Comment(5)

This will break: (1) you will also have to normalize your Unicode (2) you will have to use a specific version of the Unicode database to perform the normalization. – Eutrophic 25/1, 2013 at 4:18

@DietrichEpp I don't understand what this has to do with normalizing Unicode. – Depart 25/1, 2013 at 5:0

If you want to check if the file 'é' exists, then you have to check to see if 'e\u0301' is returned by os.listdir(), not if '\u00e9' is returned by os.listdir(). If you create a file with the name '\u00e9', os.listdir() will return 'e\u0301'. – Eutrophic 25/1, 2013 at 18:9

Okay, but the way I understand the question, this wouldn't be a problem here. Say you call listdir once and save the result somewhere. Later, you call listdir again and check if a specific file in your previous result still exists with exactly the same name. Regardless of how listdir handles unicode file names, I would expect the results to be consistent if the file name doesn't change. – Depart 25/1, 2013 at 19:42

The problem is that we don't know for sure that the results came from one machine's os.listdir() to begin with. For all I know, the asker is writing a program to synchronize directories between OS X and Windows. The OS X filenames will be normalized to NFD and the Windows filenames will be normalized to NFC, so they will compare unequal in many cases. – Eutrophic 25/1, 2013 at 22:50

This answer is just a proof of concept because it doesn't attempt to escape special characters, handle non-ASCII characters, or deal with file system encoding issues.

On the plus side, the answer doesn't involve looping through files in Python, and it properly handles checking the directory names leading up to the final path segment.

This suggestion is based on the observation that (at least when using bash), the following command finds the path /my/path without error if and only if /my/path exists with that exact casing.

$ ls /[m]y/[p]ath

(If brackets are left out of any path part, then that part will not be sensitive to changes in casing.)

Here is a sample function based on this idea:

import os.path
import subprocess

def does_exist(path):
    """Return whether the given path exists with the given casing.

    The given path should begin with a slash and not end with a trailing
    slash.  This function does not attempt to escape special characters
    and does not attempt to handle non-ASCII characters, file system
    encodings, etc.
    """
    parts = []
    while True:
        head, tail = os.path.split(path)
        if tail:
            parts.append(tail)
            path = head
        else:
            assert head == '/'
            break
    parts.reverse()
    # For example, for path "/my/path", pattern is "/[m]y/[p]ath".
    pattern = "/" + "/".join(["[%s]%s" % (p[0], p[1:]) for p in parts])
    cmd = "ls %s" % pattern
    return_code = subprocess.call(cmd, shell=True)
    return not return_code

Habakkuk answered 21/12, 2014 at 12:36 Comment(1)

Neat idea, and presumably faster than the os.listdir()-based approaches. To speed things up further, you could just use glob.glob() instead of creating a shell subprocess to call ls. Just to provide a hint of what would be needed to deal with non-ASCII characters: you'd have to deal with the fact that Python's Unicode strings are NFC (composed Unicode normal form), whereas HFS+ stores names in NFD (decomposed Unicode normal form). Sadly, Python doesn't recognize NFC and NFD strings that are conceptually identical as such. – Dippold 14/8, 2015 at 3:14

-2

You can also try to open that file.

    try:open('test', 'r')
    except IOError: print 'File does not exist'

Daffie answered 25/1, 2013 at 5:43 Comment(2)

Because the file system is not case sensitive the file would open successively regardless of the case of the file name. This would not tell me what I am trying to learn. – Pilkington 25/1, 2013 at 15:35

if you are tracking file changes, maybe it's better to use inotify? – Daffie 26/1, 2013 at 0:44

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Background

Native API support (lack thereof)

Unicode normal forms: NFC vs. NFD

Recommended topics

Hot tags