How can you strip non-ASCII characters from a string? (in C#)

Asked 23/9, 2008 at 19:45 Answered 23/3, 2023 at 19:44

282

Earthward answered 23/9, 2008 at 19:45 Comment(1)

Per sinelaw's answer below, if you instead want to replace non-ASCII characters, see this answer instead. – Veinule 10/12, 2013 at 15:37

495

string s = "søme string";
s = Regex.Replace(s, @"[^\u0000-\u007F]+", string.Empty);

The ^ is the not operator. It tells the regex to find everything that doesn't match, instead of everything that does match. The \u####-\u#### says which characters match.\u0000-\u007F is the equivalent of the first 128 characters in utf-8 or unicode, which are always the ascii characters. So you match every non ascii character (because of the not) and do a replace on everything that matches.

(as explained in a comment by Gordon Tucker Dec 11, 2009 at 21:11)

Earthward answered 23/9, 2008 at 19:46 Comment(7)

Range for printable characters is 0020-007E, for people looking for regular expression to replace non-printable characters – Godewyn 17/2, 2014 at 4:40

If you wish to see a table of the ASCII character set: asciitable.com – Kurtzig 26/2, 2015 at 15:6

Range for extended ASCII is \u0000-\u00FF, for people looking for regular expression to replace non extended ASCII characters (i.e. for apps with Spanish language, diacritics etc...) – Associative 29/12, 2015 at 21:30

@GordonTucker \u0000-\u007F is the equivilent of the first 127 characters in utf-8 or unicode and NOT the first 225. See table – Associative 29/12, 2015 at 21:33

@Associative Which is why I replied to myself about a minute later correcting myself to say it was 127 and not 255. :) – Placement 30/12, 2015 at 21:46

But 0000-0010 also containts non-ASCII characters like NUL,SOH,STX etc – Creighton 3/5, 2020 at 2:50

what about this Regex.Replace(str, @"\p{C}+", string.Empty); – Creighton 3/5, 2020 at 4:16

160

Here is a pure .NET solution that doesn't use regular expressions:

string inputString = "Räksmörgås";
string asAscii = Encoding.ASCII.GetString(
    Encoding.Convert(
        Encoding.UTF8,
        Encoding.GetEncoding(
            Encoding.ASCII.EncodingName,
            new EncoderReplacementFallback(string.Empty),
            new DecoderExceptionFallback()
            ),
        Encoding.UTF8.GetBytes(inputString)
    )
);

It may look cumbersome, but it should be intuitive. It uses the .NET ASCII encoding to convert a string. UTF8 is used during the conversion because it can represent any of the original characters. It uses an EncoderReplacementFallback to to convert any non-ASCII character to an empty string.

Ollayos answered 25/9, 2008 at 19:32 Comment(12)

Perfect! I'm using this to clean a string before saving it to a RTF document. Very much appreciated. Much easier to understand than the Regex version. – Vedis 6/10, 2009 at 16:48

You really find it easier to understand? To me, all the stuff that's not really relevant (fallbacks, conversions to bytes etc) is drawing the attention away from what actually happens. – Ollayos 11/10, 2009 at 15:28

@Brandon, actually, this technique doesn't do the job better than other techniques. So the analogy would be using a plain olde screwdriver instead of a fancy iScrewDriver Deluxe 2000. :) – Ollayos 4/8, 2011 at 7:46

@bzim It's like using a hammer on a screw :) OK not. So it's like using the crankshaft of your car engine to drive a screw. There we go. – Paraguay 22/8, 2011 at 17:10

How slow is this compared to regex? Regex is pretty fast. – Pirate 23/5, 2013 at 21:9

@InsidiousForce, probably depends on which regular expression you use. Why don't you take one of the expressions from one of the answers to this question and benchmark it? :) – Ollayos 27/5, 2013 at 8:36

One advantage is that I can easily replace ASCII with ISO 8859-1 or another encoding :) – Bushwhack 4/7, 2013 at 3:34

We have a Foxpro DB that our system uses, that gets corrupted for a pasttime. Since this function is run on almost every field of every row I was curious to know the performance difference and if there was anything better than plain regexp. For 1,000 randomly generated unicode strings the run times are Regexp: Avg: 3~4ms, Max: 4ms and Encoding Conversion: Avg: 4~5ms, Max: 7ms (not including string generation, that is outside the timer) – Lockman 16/7, 2013 at 11:46

@Lockman Interesting. This technique could probably be optimized, depending on what's taking time. The 2 Fallback instances could be re-used, for example. – Ollayos 5/8, 2013 at 12:20

I'm finding this to be faster than the regex on smaller strings (they are nearly even on 1000 character string) and slower on larger strings – Shu 2/10, 2014 at 15:17

Wondering if I could use this somehow to replace non-ascii characters with replacement character. for example: á would be replaced with a. Is this possible? – Reinhart 28/12, 2015 at 12:42

@RageCompex The EncoderReplacementFallback wasn't designed for conversion. But what you want can be achieved using the .NET APIs for Unicode Normalization and Canonicalization. – Ollayos 30/12, 2015 at 11:19

I believe MonsCamus meant:

parsememo = Regex.Replace(parsememo, @"[^\u0020-\u007E]", string.Empty);

Maclaine answered 2/8, 2013 at 13:31 Comment(1)

IMHO This answer is better than the accepted answer because it strips out control characters. – Cothurnus 25/9, 2017 at 14:30

If you want not to strip, but to actually convert latin accented to non-accented characters, take a look at this question: How do I translate 8bit characters into 7bit characters? (i.e. Ü to U)

Electrodialysis answered 5/4, 2012 at 22:30 Comment(1)

I didn't even realize this was possible, but it's a much better solution for me. I'm going to add this link to a comment on the question to make it easier for other people to find. Thanks! – Veinule 10/12, 2013 at 15:36

Inspired by philcruz's Regular Expression solution, I've made a pure LINQ solution

public static string PureAscii(this string source, char nil = ' ')
{
    var min = '\u0000';
    var max = '\u007F';
    return source.Select(c => c < min ? nil : c > max ? nil : c).ToText();
}

public static string ToText(this IEnumerable<char> source)
{
    var buffer = new StringBuilder();
    foreach (var c in source)
        buffer.Append(c);
    return buffer.ToString();
}

This is untested code.

Commix answered 27/1, 2010 at 19:0 Comment(5)

Instead of the separate ToText() method, how about replacing line 3 of PureAscii() with: return new string(source.Select(c => c < min ? nil : c > max ? nil : c).ToArray()); – Lysenko 10/11, 2011 at 5:51

Or perhaps ToText as: return (new string(source)).ToArray() - depending on what performs best. It's still nice to have ToText as an extension method - fluent/pipeline style. :-) – Commix 15/1, 2016 at 10:14

That code replaces non-ASCII characters with a space. To strip them out, change Select to Where: return new string( source.Where( c => c >= min && c <= max ).ToArray() ); – Tattle 17/5, 2017 at 20:53

@Tattle That code allows you to specify which character to replace the non-ASCII characters with. By default it uses a space, but if it's called like .PureASCII(Char.MinValue), it will replace all non-ASCII with '\0' - which still isn't exactly stripping them, but similar results. – Vespiary 29/11, 2017 at 16:42

The ToText method can be removed, and line 5 can be replaced by:

return source.Where(c => c >= min && c <= max).Aggregate(new StringBuilder(), (sb, s) => sb.Append(s), sb => sb.ToString());

– Westerman 13/8, 2019 at 6:33

I found the following slightly altered range useful for parsing comment blocks out of a database, this means that you won't have to contend with tab and escape characters which would cause a CSV field to become upset.

parsememo = Regex.Replace(parsememo, @"[^\u001F-\u007F]", string.Empty);

If you want to avoid other special characters or particular punctuation check the ascii table

Ahola answered 1/10, 2012 at 10:2 Comment(1)

In case anyone hasn't noticed the other comments, the printable characters are actually @"[^\u0020-\u007E]". Here's a link to see the table if you're curious: asciitable.com – Kurtzig 26/2, 2015 at 15:3

no need for regex. just use encoding...

sOutput = System.Text.Encoding.ASCII.GetString(System.Text.Encoding.ASCII.GetBytes(sInput));

Fourway answered 18/6, 2013 at 17:38 Comment(3)

This does not work. This does not strip unicode characters, it replaces them with the ? character. – Director 27/2, 2014 at 16:56

@Director is right. At least I got ????nacho?? when I tried: たまねこnachoなち in mono 3.4 – Antimatter 6/8, 2014 at 2:38

You can instantiate your own Encoding class that instead of replacing characters it removes them. See the GetEncoding method: msdn.microsoft.com/en-us/library/89856k4b(v=vs.110).aspx – Quilmes 1/4, 2016 at 13:52

I came here looking for a solution for extended ascii characters, but couldnt find it. The closest I found is bzlm's solution. But that works only for ASCII Code upto 127(obviously you can replace the encoding type in his code, but i think it was a bit complex to understand. Hence, sharing this version). Here's a solution that works for extended ASCII codes i.e. upto 255 which is the ISO 8859-1

It finds and strips out non-ascii characters(greater than 255)

Dim str1 as String= "â, ??î or ôu🕧� n☁i✑💴++$-💯♓!🇪🚑🌚‼⁉4⃣od;/⏬'®;😁☕😁:☝)😁😁///😍1!@#"

Dim extendedAscii As Encoding = Encoding.GetEncoding("ISO-8859-1", 
                                                New EncoderReplacementFallback(String.empty),
                                                New DecoderReplacementFallback())

Dim extendedAsciiBytes() As Byte = extendedAscii.GetBytes(str1)

Dim str2 As String = extendedAscii.GetString(extendedAsciiBytes)

console.WriteLine(str2)
'Output : â, ??î or ôu ni++$-!‼⁉4od;/';:)///1!@#$%^yz:

Here's a working fiddle for the code

Replace the encoding as per the requirement, rest should remain the same.

Lim answered 11/10, 2016 at 21:38 Comment(1)

The only one that worked to remove ONLY the Ω from this string "Ω c ç ã". Thank you very much! – Ecumenism 8/5, 2019 at 0:19

This is not optimal performance-wise, but a pretty straight-forward Linq approach:

string strippedString = new string(
    yourString.Where(c => c <= sbyte.MaxValue).ToArray()
    );

The downside is that all the "surviving" characters are first put into an array of type char[] which is then thrown away after the string constructor no longer uses it.

Calamitous answered 3/9, 2013 at 17:8 Comment(0)

I used this regex expression:

    string s = "søme string";
    Regex regex = new Regex(@"[^a-zA-Z0-9\s]", (RegexOptions)0);
    return regex.Replace(s, "");

Cochise answered 12/6, 2012 at 12:27 Comment(1)

This removes punctuation as well, just in case that's not what someone wants. – Ungava 18/7, 2012 at 8:43

I use this regular expression to filter out bad characters in a filename.

Regex.Replace(directory, "[^a-zA-Z0-9\\:_\- ]", "")

That should be all the characters allowed for filenames.

Scalpel answered 9/6, 2017 at 18:17 Comment(2)

Nope. See Path.GetInvalidPathChars and Path.GetInvalidFileNameChars. So, there are tens of thousands of valid characters. – Clerkly 10/6, 2017 at 0:4

You are correct, Tom. I was actually thinking of the common ones, but I left out parenthesis and curly braces as well as all these - ^%$#@!&+=. – Scalpel 12/6, 2017 at 20:2

public string ReturnCleanASCII(string s)
    {
        StringBuilder sb = new StringBuilder(s.Length);
        foreach (char c in s)
        {
            if ((int)c > 127) // you probably don't want 127 either
                continue;
            if ((int)c < 32)  // I bet you don't want control characters 
                continue;
            if (c == '%')
                continue;
            if (c == '?')
                continue;
            sb.Append(c);
        }
        return sb.ToString();
    }

Dappled answered 27/7, 2022 at 8:18 Comment(0)

I did a bit of testing, and @bzlm 's answer is the fastest valid answer. But it turns out we can do much faster. The conversion using encoding is equivalent to the following code when inlining Encoding.Convert

public static string StripUnicode(string unicode) {
    Encoding dstEncoding = GreedyAscii;
    Encoding srcEncoding = Encoding.UTF8;
    return dstEncoding.GetString(dstEncoding.GetBytes(srcEncoding.GetChars(srcEncoding.GetBytes(unicode))));
}

As you can clearly see we perform two redundant actions by reencoding UTF8. Why is that you may ask? C# exclusively stores strings in UTF16 graphmemes. These can ofc also be UTF8 graphmemes, since unicode is intercompatible. (Sidenote: @bzlm 's solution breaks UTF16 characters which may throw an exception during transcoding.) => The operation is independant of the source encoding, since it always is UTF16.

Lets get rid of the redundant reencoding, and prevent edgecase failures.

public static string StripUnicode(string unicode) {
    Encoding dstEncoding = GreedyAscii;
    return dstEncoding.GetString(dstEncoding.GetBytes(unicode));
}

We alreadly have a simplified and perfectly workable solution. Which requries less then half as much time to compute.

There is not much more performance to be gained, but for further memory optimization we can do two things:

Accept a ReadOnlySpan<char> for a more usable api.
Attempt to fit the tempoary byte[] unto the stack; otherwise use an array pool.

public static string StripUnicode(ReadOnlySpan<char> unicode) {
    return EnsureEncoding(unicode, GreedyAscii);
}

/// <summary>Produces a string which is compatible with the limiting encoding</summary>
/// <remarks>Ensure that the encoding does not throw on illegal characters</remarks>
public static string EnsureEncoding(ReadOnlySpan<char> unicode, Encoding limitEncoding) {
    int asciiBytesLength = limitEncoding.GetMaxByteCount(unicode.Length);
    byte[]? asciiBytes = asciiBytesLength <= 2048 ? null : ArrayPool<byte>.Shared.Rent(asciiBytesLength);
    Span<byte> asciiSpan = asciiBytes ?? stackalloc byte[asciiBytesLength];

    asciiBytesLength = limitEncoding.GetBytes(unicode, asciiSpan);
    asciiSpan = asciiSpan.Slice(0, asciiBytesLength);

    string asciiChars = limitEncoding.GetString(asciiSpan);
    if (asciiBytes is { }) {
        ArrayPool<byte>.Shared.Return(asciiBytes);
    }

    return asciiChars;
}

private static Encoding GreedyAscii { get; } = Encoding.GetEncoding(Encoding.ASCII.EncodingName, new EncoderReplacementFallback(string.Empty), new DecoderExceptionFallback());

You can see this snipped in action on sharplab.io

Radiculitis answered 26/2, 2023 at 15:51 Comment(0)

If you want a string with only ISO-8859-1 characters and excluding characters which are not standard, you should use this expression :

var result = Regex.Replace(value, @"[^\u0020-\u007E\u00A0-\u00FF]+", string.Empty);

Note : Using Encoding.GetEncoding("ISO-8859-1") method will not do the job because undefined characters are not excluded.

.Net Fiddle sample

Wikipedia ISO-8859-1 code page for more details.

Accident answered 16/7, 2022 at 11:12 Comment(0)

Just decode unicode using by Regex.Unescape(s)

Bayou answered 10/3, 2023 at 8:20 Comment(1)

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center. – Tierell 17/3, 2023 at 17:33

You can use Char.IsAscii to identify the characters you want to keep. A simple implementation might look like:

public static string StripNonAscii(this string input)
{
    StringBuilder resultBuilder = new();
    foreach (char character in input)
        if (char.IsAscii(character))
            resultBuilder.Append(character);
    return resultBuilder.ToString();
}

Deactivate answered 23/3, 2023 at 19:44 Comment(0)

-1

Necromancing.
Also, the method by bzlm can be used to remove characters that are not in an arbitrary charset, not just ASCII:

// https://en.wikipedia.org/wiki/Code_page#EBCDIC-based_code_pages
// https://en.wikipedia.org/wiki/Windows_code_page#East_Asian_multi-byte_code_pages
// https://en.wikipedia.org/wiki/Chinese_character_encoding
System.Text.Encoding encRemoveAllBut = System.Text.Encoding.ASCII;
encRemoveAllBut = System.Text.Encoding.GetEncoding(System.Globalization.CultureInfo.InstalledUICulture.TextInfo.ANSICodePage); // System-encoding
encRemoveAllBut = System.Text.Encoding.GetEncoding(1252); // Western European (iso-8859-1)
encRemoveAllBut = System.Text.Encoding.GetEncoding(1251); // Windows-1251/KOI8-R
encRemoveAllBut = System.Text.Encoding.GetEncoding("ISO-8859-5"); // used by less than 0.1% of websites
encRemoveAllBut = System.Text.Encoding.GetEncoding(37); // IBM EBCDIC US-Canada
encRemoveAllBut = System.Text.Encoding.GetEncoding(500); // IBM EBCDIC Latin 1
encRemoveAllBut = System.Text.Encoding.GetEncoding(936); // Chinese Simplified
encRemoveAllBut = System.Text.Encoding.GetEncoding(950); // Chinese Traditional
encRemoveAllBut = System.Text.Encoding.ASCII; // putting ASCII again, as to answer the question 

// https://mcmap.net/q/11561/-how-can-you-strip-non-ascii-characters-from-a-string-in-c
string inputString = "RäksmörПривет, мирgås";
string asAscii = encRemoveAllBut.GetString(
    System.Text.Encoding.Convert(
        System.Text.Encoding.UTF8,
        System.Text.Encoding.GetEncoding(
            encRemoveAllBut.CodePage,
            new System.Text.EncoderReplacementFallback(string.Empty),
            new System.Text.DecoderExceptionFallback()
            ),
        System.Text.Encoding.UTF8.GetBytes(inputString)
    )
);

System.Console.WriteLine(asAscii);

AND for those that just want to remote the accents:
(caution, because Normalize != Latinize != Romanize)

// string str = Latinize("(æøå âôû?aè");
public static string Latinize(string stIn)
{
    // Special treatment for German Umlauts
    stIn = stIn.Replace("ä", "ae");
    stIn = stIn.Replace("ö", "oe");
    stIn = stIn.Replace("ü", "ue");

    stIn = stIn.Replace("Ä", "Ae");
    stIn = stIn.Replace("Ö", "Oe");
    stIn = stIn.Replace("Ü", "Ue");
    // End special treatment for German Umlauts

    string stFormD = stIn.Normalize(System.Text.NormalizationForm.FormD);
    System.Text.StringBuilder sb = new System.Text.StringBuilder();

    for (int ich = 0; ich < stFormD.Length; ich++)
    {
        System.Globalization.UnicodeCategory uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);

        if (uc != System.Globalization.UnicodeCategory.NonSpacingMark)
        {
            sb.Append(stFormD[ich]);
        } // End if (uc != System.Globalization.UnicodeCategory.NonSpacingMark)

    } // Next ich


    //return (sb.ToString().Normalize(System.Text.NormalizationForm.FormC));
    return (sb.ToString().Normalize(System.Text.NormalizationForm.FormKC));
} // End Function Latinize

Annabal answered 7/1, 2021 at 0:19 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags