This answer complements the existing ones by providing functions, adapted from Alex L's answer, that:
import os, unicodedata
def gettruecasepath(path): # IMPORTANT: <path> must be a Unicode string
if not os.path.lexists(path): # use lexists to also find broken symlinks
raise OSError(2, u'No such file or directory', path)
isosx = sys.platform == u'darwin'
if isosx: # convert to NFD for comparison with os.listdir() results
path = unicodedata.normalize('NFD', path)
parentpath, leaf = os.path.split(path)
# find true case of leaf component
if leaf not in [ u'.', u'..' ]: # skip . and .. components
leaf_lower = leaf.lower() # if you use Py3.3+: change .lower() to .casefold()
found = False
for leaf in os.listdir(u'.' if parentpath == u'' else parentpath):
if leaf_lower == leaf.lower(): # see .casefold() comment above
found = True
if isosx:
leaf = unicodedata.normalize('NFC', leaf) # convert to NFC for return value
break
if not found:
# should only happen if the path was just deleted
raise OSError(2, u'Unexpectedly not found in ' + parentpath, leaf_lower)
# recurse on parent path
if parentpath not in [ u'', u'.', u'..', u'/', u'\\' ] and \
not (sys.platform == u'win32' and
os.path.splitdrive(parentpath)[1] in [ u'\\', u'/' ]):
parentpath = gettruecasepath(parentpath) # recurse
return os.path.join(parentpath, leaf)
def istruecasepath(path): # IMPORTANT: <path> must be a Unicode string
return gettruecasepath(path) == unicodedata.normalize('NFC', path)
gettruecasepath()
gets the case-exact representation as stored in the filesystem of the specified path (absolute or relative) path, if it exists:
- The input path must be a Unicode string:
- Python 3.x: strings are natively Unicode - no extra action needed.
- Python 2.x: literals: prefix with
u
; e.g., u'Motörhead'
; str variables: convert with, e.g., strVar.decode('utf8')
- The string returned is a Unicode string in NFC (composed normal form). NFC is returned even on OSX, where the filesystem (HFS+) stores names in NFD (decomposed normal form).
NFC is returned, because it is far more common than NFD, and Python
doesn't recognize equivalent NFC and NFD strings as (conceptually) identical. See below for background information.
- The path returned retains the structure of the input path (relative vs. absolute, components such as
.
and ..
), except that multiple path separators are collapsed, and, on Windows, the returned path always uses \
as the path separator.
- On Windows, a drive / UNC-share component, if present, is retained as-is.
- An
OSError
exception is thrown if the path does not exist, or if you do not have permission to access it.
- If you use this function on a case-sensitive filesystem, e.g., on Linux with ext4, it effectively degrades to indicating whether the input path exists in the exact case specified or not.
istruecasepath()
uses gettruecasepath()
to compare the input path to the path as stored in the filesystem.
Caveat: Since these functions need to examine all directory entries at every level of the input path (as specified), they will be slow - unpredictably so, as performance will correspond to how many items the directories examined contain. Read on for background information.
Background
Native API support (lack thereof)
It is curious that neither OSX nor Windows provide a native API method that directly solves this problem.
While on Windows you can cleverly combine two API methods to solve the problem, on OSX there is no alternative that I'm aware of to the - unpredictably - slow enumeration of directory contents on each level of the path examined, as employed above.
Unicode normal forms: NFC vs. NFD
HFS+ (OSX' filesystem) stores filenames in decomposed Unicode form (NFD), which causes problems when comparing such names to in-memory Unicode strings in most programming languages, which are usually in composed Unicode form (NFC).
For instance, a path with non-ASCII character ü
that you specify as a literal in your source code will be represented as single Unicode codepoint, U+00FC
; this is an example of NFC: the 'C' stands for composed, because the letter base letter u
and its diacritic ¨
(a combining diaeresis) form a single letter.
By contrast, if you use ü
as a part of an HFS+ filename, it is translated to NFD form, which results in 2 Unicode codepoints: the base letter u
(U+0075
), followed by the combining diaeresis (̈
, U+0308
) as a separate codepoint; the 'D' stands for decomposed, because the character is decomposed into the base letter and its associated diacritic.
Even though the Unicode standard deems these 2 representations (canonically) equivalent, most programming languages, including Python, do not recognize such equivalence.
In the case of Python, you must use unicodedata.normalize()
to convert both strings to the same form before comparing.
(Side note: Unicode normal forms are separate from Unicode encodings, though the differing numbers of Unicode code points typically also impact the number of bytes needed to encode each form. In the example above, the single-codepoint ü
(NFC) requires 2 bytes to encode in UTF-8 (U+00FC
-> 0xC3 0xBC
), whereas the two-codepoint ü
(NFD) requires 3 bytes (U+0075
-> 0x75
, and U+0308
-> 0xCC 0x88
)).