How to remove bad path characters in Python?
Asked Answered
M

6

48

What is the most cross platform way of removing bad path characters (e.g. "\" or ":" on Windows) in Python?

Solution

Because there seems to be no ideal solution I decided to be relatively restrictive and did use the following code:

def remove(value, deletechars):
    for c in deletechars:
        value = value.replace(c,'')
    return value;

print remove(filename, '\/:*?"<>|')
Manuel answered 23/6, 2009 at 15:45 Comment(2)
maybe a little faster, if the path is long: "".join(i for i in value if i not in r'\/:*?"<>|')Boni
@fortran, this should be an answer, not a comment - it's very 'pythonic' in my personal opinion. Thanks.Dimer
S
27

Unfortunately, the set of acceptable characters varies by OS and by filesystem.

  • Windows:

    • Use almost any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:
      • The following reserved characters are not allowed:
        < > : " / \ | ? *
      • Characters whose integer representations are in the range from zero through 31 are not allowed.
      • Any other character that the target file system does not allow.

    The list of accepted characters can vary depending on the OS and locale of the machine that first formatted the filesystem.

    .NET has GetInvalidFileNameChars and GetInvalidPathChars, but I don't know how to call those from Python.

  • Mac OS: NUL is always excluded, "/" is excluded from POSIX layer, ":" excluded from Apple APIs
    • HFS+: any sequence of non-excluded characters that is representable by UTF-16 in the Unicode 2.0 spec
    • HFS: any sequence of non-excluded characters representable in MacRoman (default) or other encodings, depending on the machine that created the filesystem
    • UFS: same as HFS+
  • Linux:
    • native (UNIX-like) filesystems: any byte sequence excluding NUL and "/"
    • FAT, NTFS, other non-native filesystems: varies

Your best bet is probably to either be overly-conservative on all platforms, or to just try creating the file name and handle errors.

Skald answered 23/6, 2009 at 16:21 Comment(3)
Note that on Windows, you'll also have issues if you try to use filenames like CON.*. And spaces at the end of a filename tend to cause problems too.Baseman
@Baseman Yes, the legacy DOS device names cannot be used as filenames in Win32. But the filesystem supports them just fine, and using the NT APIs to get around Win32 works fine. (At least, as far as I recall; I haven't got a Windows machine to test on anymore.)Skald
You may be able to do it using NT APIs, but Python can't. Python on windows is unfortunately restricted in filename handling. The worst part is that often times the bad filenames will fail silently or give you a different file than what you asked for (try opening CON in a script run from the console).Baseman
P
48

I think the safest approach here is to just replace any suspicious characters. So, I think you can just replace (or get rid of) anything that isn't alphanumeric, -, _, a space, or a period. And here's how you do that:

import re
re.sub(r'[^\w_. -]', '_', filename)

The above escapes every character that's not a letter, '_', '-', '.' or space with an '_'. So, if you're looking at an entire path, you'll want to throw os.sep in the list of approved characters as well.

Here's some sample output:

In [27]: re.sub(r'[^\w\-_\. ]', '_', r'some\*-file._n\\ame')
Out[27]: 'some__-file._n__ame'
Pasley answered 27/11, 2012 at 22:1 Comment(5)
+1, helpful answer. Do these backslashes need to be escaped though?Helianthus
Better to use a r'raw string'.Stipendiary
Yeah ... I think if you don't use r'...', you'll still need a backslash in front of each of those backslashes. Thus a total of 10 backslashes.Helianthus
Looks like I got carried away with the last edit. It was right exactly as it was. Keep in mind that it's only allowing specific characters (not excluding a set of characters). Raw string was unnecessary. See my clarification and the sample output in the updated answer.Pasley
If you want to combine multiple escaped characters into a single _ add a + to the regex search string: '[^\w\-_\. ]+'Gook
S
27

Unfortunately, the set of acceptable characters varies by OS and by filesystem.

  • Windows:

    • Use almost any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:
      • The following reserved characters are not allowed:
        < > : " / \ | ? *
      • Characters whose integer representations are in the range from zero through 31 are not allowed.
      • Any other character that the target file system does not allow.

    The list of accepted characters can vary depending on the OS and locale of the machine that first formatted the filesystem.

    .NET has GetInvalidFileNameChars and GetInvalidPathChars, but I don't know how to call those from Python.

  • Mac OS: NUL is always excluded, "/" is excluded from POSIX layer, ":" excluded from Apple APIs
    • HFS+: any sequence of non-excluded characters that is representable by UTF-16 in the Unicode 2.0 spec
    • HFS: any sequence of non-excluded characters representable in MacRoman (default) or other encodings, depending on the machine that created the filesystem
    • UFS: same as HFS+
  • Linux:
    • native (UNIX-like) filesystems: any byte sequence excluding NUL and "/"
    • FAT, NTFS, other non-native filesystems: varies

Your best bet is probably to either be overly-conservative on all platforms, or to just try creating the file name and handle errors.

Skald answered 23/6, 2009 at 16:21 Comment(3)
Note that on Windows, you'll also have issues if you try to use filenames like CON.*. And spaces at the end of a filename tend to cause problems too.Baseman
@Baseman Yes, the legacy DOS device names cannot be used as filenames in Win32. But the filesystem supports them just fine, and using the NT APIs to get around Win32 works fine. (At least, as far as I recall; I haven't got a Windows machine to test on anymore.)Skald
You may be able to do it using NT APIs, but Python can't. Python on windows is unfortunately restricted in filename handling. The worst part is that often times the bad filenames will fail silently or give you a different file than what you asked for (try opening CON in a script run from the console).Baseman
S
2

You can use sanitize_filepath() method from the cross-platform module, that removes all bad (system) characters from the path:

from pathvalidate import sanitize_filepath
filename= sanitize_filepath(filename)
Safeguard answered 10/2, 2023 at 6:47 Comment(4)
Remember that Stack Overflow isn't just intended to solve the immediate problem, but also to help future readers find solutions to similar problems, which requires understanding the underlying code. This is especially important for members of our community who are beginners, and not familiar with the syntax. Given that, can you edit your answer to include an explanation of what you're doing and why you believe it is the best approach?Dysphemism
Heads up: This solution works fine but requires installing a 3rd party package: pathvalidateFurore
still not working using the packageAllyn
This solution can not work only if you have slashes in the file path, i.e. your filename something like 'hello\world' or 'hello/world' - it will be sanitized as correct value, but means that you have file 'world' in the folder 'hello' this symbols should be removed or changed using: .replace('\\','_') for example, or an empty value: .replace('\\',''). In other cases help me a lot.Safeguard
V
-1

That character is in os.sep, it'll be "\" or ":", depending on which system you're on.

Virg answered 23/6, 2009 at 15:48 Comment(1)
That doesn't include :"%/<>^|?, which are also illegal file characters in Windows.Skald
F
-1

If you are using python try os.path to avoid cross platform issues with paths.

Fatigue answered 23/6, 2009 at 16:8 Comment(1)
Which part of os.path helps with determining legal filenames? .supports_unicode_filenames maybe a little, but that's not enough.Skald
P
-1

I have spent a lot of time solving this issue while I'm writing code to read a lot of CSV's and Excels and not knowing which file makes an issue for me. I tried a lot of approaches but finally, I got a simple trick in my mind to get rid of this error

just try this

try:
   df = pd.read_excel(file_path)

except:
   df = pd.read_excel(file_path[1:])

it will simply try to read file,if getting error it step over and remove the bad character, due to window issues or OS issue, read the file and make the code proceedable.

it works best in case of bad characters error.

Phthalocyanine answered 1/12, 2023 at 21:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.