What is the most correct regular expression for a UNIX file path?
Asked Answered
R

6

21

What is the most correct regular expression (regex) for a UNIX file path?

For example, to detect something like this:

/usr/lib/libgccpp.so.1.0.2

It's pretty easy to make a regular expression that will match most files, but what's the best one, including one that can detect escaped whitespace sequences, and unusual characters you don't usually find in file paths on UNIX.

Also, are there library functions in several different programming languages that provide a file path regex?

Referent answered 11/2, 2009 at 16:57 Comment(1)
"escaped whitespace sequences"? Using what escape syntax? UNIX paths have no such escapes. sh/ksh/bash have a mostly common escape syntax, URL's have another, Perl yet another.Cozmo
F
14

If you don't mind false positives for identifying paths, then you really just need to ensure the path doesn't contain a NUL character; everything else is permitted (in particular, / is the name-separator character). The better approach would be to resolve the given path using the appropriate file IO function (e.g. File.exists(), File.getCanonicalFile() in Java).

Long answer:

This is both operating system and file system dependent. For example, the Wikipedia comparison of file systems notes that besides the limits imposed by the file system,

MS-DOS, Microsoft Windows, and OS/2 disallow the characters \ / : ? * " > < | and NUL in file and directory names across all filesystems. Unices and Linux disallow the characters / and NUL in file and directory names across all filesystems.

In Windows, the following reserved device names are also not permitted as filenames:

CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5,
COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, 
LPT5, LPT6, LPT7, LPT8, LPT9
Forwent answered 11/2, 2009 at 17:13 Comment(3)
Additional: because of the variety between file systems, there are methods that get you the information you need.Firstclass
Those Win special devices are even worse than you think. I once renamed a C header from const.h to con.h and the compiler seemed to hang. Took a while to figure out it was reading the header file from the console because Win ignored the extension. Caveat: this may have been DOS, it was a long time ago.Glarum
Useful information, but I don't understand why this non-answer is accepted...?Osteopathy
C
15

The proper regular expression to match all UNIX paths is: [^\0]+

That is, one or more characters that are not a NUL.

Cozmo answered 11/2, 2009 at 17:29 Comment(1)
and '//' is a valid path, with or without the ''sLatrishalatry
F
14

If you don't mind false positives for identifying paths, then you really just need to ensure the path doesn't contain a NUL character; everything else is permitted (in particular, / is the name-separator character). The better approach would be to resolve the given path using the appropriate file IO function (e.g. File.exists(), File.getCanonicalFile() in Java).

Long answer:

This is both operating system and file system dependent. For example, the Wikipedia comparison of file systems notes that besides the limits imposed by the file system,

MS-DOS, Microsoft Windows, and OS/2 disallow the characters \ / : ? * " > < | and NUL in file and directory names across all filesystems. Unices and Linux disallow the characters / and NUL in file and directory names across all filesystems.

In Windows, the following reserved device names are also not permitted as filenames:

CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5,
COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, 
LPT5, LPT6, LPT7, LPT8, LPT9
Forwent answered 11/2, 2009 at 17:13 Comment(3)
Additional: because of the variety between file systems, there are methods that get you the information you need.Firstclass
Those Win special devices are even worse than you think. I once renamed a C header from const.h to con.h and the compiler seemed to hang. Took a while to figure out it was reading the header file from the console because Win ignored the extension. Caveat: this may have been DOS, it was a long time ago.Glarum
Useful information, but I don't understand why this non-answer is accepted...?Osteopathy
E
11

To others who have answered this question, it's important to note that some applications would require a slightly different regex, depending on how escape characters work in the program you're writing. If you were writing a shell, for example, and wanted to have command separated by spaces and other special characters, you would have to modify your regex to only include words with special characters if those characters are escaped.

So, for example, a valid path would be

  /usr/bin/program\ with\ space 

as opposed to

  /usr/bin/program with space 

which would refer to "/usr/bin/program" with arguments "with" and "space"

A regex for the above example could be "([^\0 ]\|\\ )*"

The regex that I've been working on is (newline separated for 'readability'):

  "\(                    # Either
       [^\0 !$`&*()+]    # A normal (non-special) character
     \|                  # Or
       \\\(\ |\!|\$|\`|\&|\*|\(|\)|\+\)   # An escaped special character
   \)\+"                   # Repeated >= 1 times

Which translates to

  "\([^\0 !$`&*()+]\|\\\(\ |\!|\$|\`|\&|\*|\(|\)|\+\)\)\+"

Creating your own specific regex should be relatively simple, as well.

Embalm answered 6/4, 2012 at 18:18 Comment(1)
As an alternative to enumerating all of the escaped characters, you can simply make a group that consists of the escape followed by the class of escaped characters ([^ !$`&*()+]|(\\[ !$`&*()+]))+Feldspar
S
7
^(/)?([^/\0]+(/)?)+$

This will accept every path that is legal in filesystems such as extX, reiserfs.

It discards only the path names containing the NUL or double (or more) slashes. Everything else according to Unix spec should be legal (I'm suprised with this outcome too).

Strapless answered 7/12, 2012 at 12:7 Comment(4)
double slashes are perfectly fine in unix paths, both in POSIX and in practise, so your regex is incorrect. the only character (or rather, octet) not allowed in unix pathnames is \0Latrishalatry
@RememberMonica are you saying a path like /var///test/file.txt is valid?Osteopathy
@Osteopathy Yes that's a perfectly valid file path. /var///test/file.txt and /var/test/file.txt are equivalent. This convention makes some file path operations simpler. E.g. userProvidedPath + "/filename.txt" works wether userProvidedPath contains a trailing slash or not.Meadors
Note that a variant of this regexp has proven to be susceptible to catastrophic backtracking for us, at least on Ruby-embedded Oniguruma, if the input string contains multiple forward slashes following each other. Something to keep in mind.Drivel
K
4

I'm not sure how common a regex check for this is across systems, but most programming languages (especially the cross platform ones) provide a "file exists" check which will take this kind of thing into account

Out of curiosity, where are these paths being input? Could you control that to a greater degress to the point where you won't have to check the individual pieces of the path? For example using a file chooser dialog?

Kurdish answered 11/2, 2009 at 17:19 Comment(0)
N
1

Question already answered here: https://mcmap.net/q/478197/-c-regex-for-file-paths-e-g-c-test-test-exe

Nitwit answered 4/2, 2017 at 3:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.