How to remove illegal characters from path and filenames?

E

30

602

I need a robust and simple way to remove illegal path and file characters from a simple string. I've used the below code but it doesn't seem to do anything, what am I missing?

using System;
using System.IO;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            string illegal = "\"M<>\"\\a/ry/ h**ad:>> a\\/:*?\"<>| li*tt|le|| la\"mb.?";

            illegal = illegal.Trim(Path.GetInvalidFileNameChars());
            illegal = illegal.Trim(Path.GetInvalidPathChars());

            Console.WriteLine(illegal);
            Console.ReadLine();
        }
    }
}

Envision answered 28/9, 2008 at 15:52 Comment(6)

Trim removes characters from the beginning and end of strings. However, you probably should ask why the data is invalid, and rather than try and sanitize/fix the data, reject the data. – Rumen 28/9, 2008 at 15:54

Unix style names are not valid on Windows and i don't want to deal with 8.3 shortnames. – Envision 16/10, 2009 at 12:4

GetInvalidFileNameChars() will strip things like : \ etc from folder paths. – Addington 20/5, 2016 at 3:18

Path.GetInvalidPathChars() doesn't seem to strip * or ? – Addington 20/5, 2016 at 3:24

I tested five answers from this question (timed loop of 100,000) and the following method is the fastest. The regular expression took 2nd place, and was 25% slower : public string GetSafeFilename(string filename) { return string.Join("_", filename.Split(Path.GetInvalidFileNameChars())); } – Inkwell 15/7, 2016 at 15:20

I added a new fast alternative, and some benchmarks in this answer – Coleridge 29/9, 2020 at 14:7

C

555

Try something like this instead;

string illegal = "\"M\"\\a/ry/ h**ad:>> a\\/:*?\"| li*tt|le|| la\"mb.?";
string invalid = new string(Path.GetInvalidFileNameChars()) + new string(Path.GetInvalidPathChars());

foreach (char c in invalid)
{
    illegal = illegal.Replace(c.ToString(), ""); 
}

But I have to agree with the comments, I'd probably try to deal with the source of the illegal paths, rather than try to mangle an illegal path into a legitimate but probably unintended one.

Edit: Or a potentially 'better' solution, using Regex's.

string illegal = "\"M\"\\a/ry/ h**ad:>> a\\/:*?\"| li*tt|le|| la\"mb.?";
string regexSearch = new string(Path.GetInvalidFileNameChars()) + new string(Path.GetInvalidPathChars());
Regex r = new Regex(string.Format("[{0}]", Regex.Escape(regexSearch)));
illegal = r.Replace(illegal, "");

Still, the question begs to be asked, why you're doing this in the first place.

Creigh answered 28/9, 2008 at 16:3 Comment(22)

I don't know if I should +1 your answer for having such an ill-performing solution that will push the user away from that path, or if I should +1 your answer for it actually answering his question! :) – Rumen 28/9, 2008 at 16:5

@Michael Stum: they get 'compiled' and should be some sort of state machine, but it would be naive to assume they are guaranteed to be any more efficient under the hood than a loop. – Rumen 28/9, 2008 at 16:10

On something the length of a path, it probably wouldn't make that much of a difference. On a longer string, I imagine the regex would be faster though. – Creigh 28/9, 2008 at 16:15

I'd stick to the non-regex solution: it's likely to be more efficient most of the time. If using the regex solution, change string.Format() to just "["+"...". If you're going to treat illegal as a file name without path after replacing special chars then you'd only need Path.InvalidFileNameChars(). – Slosberg 19/8, 2010 at 17:58

It's not necessary to append the two lists together. The illegal file name char list contains the illegal path char list and has a few more. Here are lists of both lists cast to int: 34,60,62,124,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,58,42,63,92,47 34,60,62,124,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 – Dill 11/4, 2011 at 18:12

@sjbotha this may be true on Windows and Microsoft's implementation of .NET I'm not willing to make the same assumption for say mono running Linux. – Creigh 17/4, 2011 at 1:24

Regarding the first solution. Shouldn't a StringBuilder be more efficient than the string assignments? – Lappet 30/12, 2011 at 15:53

If the string contains Chinese characters, the solution could fail. – Nd 2/1, 2012 at 5:13

@PerlDev: Have you actually tested that? characters should be multi-byte compatible (sizeof(char) == 2), so it shouldn't be an issue. The regex solution should be fine also. – Creigh 17/1, 2012 at 8:47

What's the problem with sanitization, Bob Tables? – Kacikacie 8/11, 2013 at 21:2

Correct me if I'm wrong, but calling both Path.GetInvalidFileNameChars() and Path.GetInvalidPathChars() is superfluous. Path.GetInvalidFileNameChars() alone should be sufficient. – Asti 13/11, 2013 at 18:34

@JoeyAdams: see my reply to Sarel Botha. In short, one is a superset of the other on Windows. Personally, I'm not willing to make the same bet cross platform and C# and .NET in general is getting a wider and wider audience via Mono all the time. – Creigh 15/11, 2013 at 8:18

For what it's worth, @MatthewScharley, the Mono implementation of GetInvalidPathChars() returns only 0x00 and GetInvalidFileNameChars() returns only 0x00 and '/' when running on non-Windows platforms. On Windows, the lists of invalid characters is much longer, and GetInvalidPathChars() is entirely duplicated inside GetInvalidFileNameChars(). This isn't going to change in the forseeable future, so all you're really doing is doubling the amount of time this function takes to run because you're worried that the definition of a valid path will change sometime soon. Which it won't. – Orthotropic 27/1, 2014 at 19:9

And let's be super-clear about this: This part of the Mono source code hasn't changed in EIGHT YEARS except for a minor perf improvement in 2007. – Orthotropic 27/1, 2014 at 19:11

@Warren: Feel free to dedupe the resultant string if you really are worried, but lets be perfectly honest here: The difference between 20 and 40 iterations against a string the length of your average path (lets say 100 characters to be generous) will make exactly no difference to the runtime of your function. For all practical purposes, there's no need to worry about it. On the other hand, these two functions do serve different purposes and (in my mind at least), it would be perfectly reasonable for one function to not return a superset of the other for some given file system. – Creigh 29/1, 2014 at 5:40

How can doing double the work (whether it's deduplicating the array, or running through almost precisely the same array values twice) take "exactly no difference"? You know as well as I do that this is incorrect, so -don't- -say- -it-. We're trying to be an educational resource at Stackoverflow, not a place for rhetorical flourishes prompted by being told you're wrong. Let's be clear: What you're recommending here is effectively the same as the old Daily WTF canard about providing your own definition of TRUE and FALSE because you can't trust the compiler or libraries to always get it right. – Orthotropic 29/1, 2014 at 16:43

GetInvalidFileNameChars() is always -- ALWAYS, you hear me -- going to include everything in GetInvalidPathChars() because it isn't possible for a file to have a character in that isn't valid in a path name. No file system allows this today, no file system ever will. And anyways, Microsoft's own documentation for these functions is very clear in stating that you should not expect the list of characters to be guaranteed as accurate because file systems might support something different anyways. – Orthotropic 29/1, 2014 at 16:52

I'd probably side with Matthew here and just say that assumption is the mother of all mess ups. You are talking about optimising code which probably doesn't need optimizing over potential correctness. I'd take the correctness over the premature optimisation any day – Unroot 15/3, 2014 at 17:50

@Unroot this discussion is so unnecessary... code should always be optimized and there is no risk of this to be incorrect. A filename is a part of the path, too. So it is just illogical that GetInvalidPathChars() could contain characters that GetInvalidFileNameChars() wouldn't. You are not taking correctness over "premature" optimisation. You are simply using bad code. – Kiley 9/8, 2014 at 11:54

Personally i would prefer this way:

var invalid = Path.GetInvalidFileNameChars().Union(Path.GetInvalidPathChars());             foreach(char c in invalid)                 illegal = illegal.Replace(c.ToString(), "_");

– Duly 9/9, 2015 at 12:20

I'm not sure why you guys are so nosy about why he wants to use it. There are various legit scenarios where this would be useful. Our app for example outputs xlsx files to email as reports and if we don't validate it on entry, you won't know until the scheduled time of creation of the report that the filename was invalid. We've had issues where in the past someone accidently entered a less-than in the filename and saved it. Plus some of our clients run linux and some run windows so the allowed files aren't the same. – Norland 30/11, 2018 at 17:51

@JohnLord another common use case is dealing with filenames coming in from outside emails. You cannot control the file name being sent to you. You can, of course, throw away the original and replace it with something of your own devising, but there are cases where you want to retain as much of the original as possible for AI purposes. – Runagate 3/6, 2020 at 17:0

N

618

The original question asked to "remove illegal characters":

public string RemoveInvalidChars(string filename)
{
    return string.Concat(filename.Split(Path.GetInvalidFileNameChars()));
}

You may instead want to replace them:

public string ReplaceInvalidChars(string filename)
{
    return string.Join("_", filename.Split(Path.GetInvalidFileNameChars()));    
}

This answer was on another thread by Ceres, I really like it neat and simple.

Nihi answered 20/4, 2014 at 13:6 Comment(4)

To precisely answer the OP's question, you would need to use "" instead of "_", but your answer probably applies to more of us in practice. I think replacing illegal characters with some legal one is more commonly done. – Rhyner 8/1, 2016 at 20:27

I tested five methods from this question (timed loop of 100,000) and this method is the fastest one. The regular expression took 2nd place, and was 25% slower than this method. – Inkwell 15/7, 2016 at 15:19

To address @BH 's comment, one can simply use string.Concat(name.Split(Path.GetInvalidFileNameChars())) – Leoraleos 7/6, 2017 at 14:6

Suprisingly the Split/Join code is about as fast as a foreach loop, it has the same performance. – Conjugal 19/10, 2021 at 20:12

C

555

Try something like this instead;

string illegal = "\"M\"\\a/ry/ h**ad:>> a\\/:*?\"| li*tt|le|| la\"mb.?";
string invalid = new string(Path.GetInvalidFileNameChars()) + new string(Path.GetInvalidPathChars());

foreach (char c in invalid)
{
    illegal = illegal.Replace(c.ToString(), ""); 
}

But I have to agree with the comments, I'd probably try to deal with the source of the illegal paths, rather than try to mangle an illegal path into a legitimate but probably unintended one.

Edit: Or a potentially 'better' solution, using Regex's.

string illegal = "\"M\"\\a/ry/ h**ad:>> a\\/:*?\"| li*tt|le|| la\"mb.?";
string regexSearch = new string(Path.GetInvalidFileNameChars()) + new string(Path.GetInvalidPathChars());
Regex r = new Regex(string.Format("[{0}]", Regex.Escape(regexSearch)));
illegal = r.Replace(illegal, "");