How can you strip non-ASCII characters from a string? (in C#)
Asked Answered
E

17

282

How can you strip non-ASCII characters from a string? (in C#)

Earthward answered 23/9, 2008 at 19:45 Comment(1)
Per sinelaw's answer below, if you instead want to replace non-ASCII characters, see this answer instead.Veinule
E
495
string s = "søme string";
s = Regex.Replace(s, @"[^\u0000-\u007F]+", string.Empty);

The ^ is the not operator. It tells the regex to find everything that doesn't match, instead of everything that does match. The \u####-\u#### says which characters match.\u0000-\u007F is the equivalent of the first 128 characters in utf-8 or unicode, which are always the ascii characters. So you match every non ascii character (because of the not) and do a replace on everything that matches.

(as explained in a comment by Gordon Tucker Dec 11, 2009 at 21:11)

Earthward answered 23/9, 2008 at 19:46 Comment(7)
Range for printable characters is 0020-007E, for people looking for regular expression to replace non-printable charactersGodewyn
If you wish to see a table of the ASCII character set: asciitable.comKurtzig
Range for extended ASCII is \u0000-\u00FF, for people looking for regular expression to replace non extended ASCII characters (i.e. for apps with Spanish language, diacritics etc...)Associative
@GordonTucker \u0000-\u007F is the equivilent of the first 127 characters in utf-8 or unicode and NOT the first 225. See tableAssociative
@Associative Which is why I replied to myself about a minute later correcting myself to say it was 127 and not 255. :)Placement
But 0000-0010 also containts non-ASCII characters like NUL,SOH,STX etcCreighton
what about this Regex.Replace(str, @"\p{C}+", string.Empty);Creighton
O
160

Here is a pure .NET solution that doesn't use regular expressions:

string inputString = "Räksmörgås";
string asAscii = Encoding.ASCII.GetString(
    Encoding.Convert(
        Encoding.UTF8,
        Encoding.GetEncoding(
            Encoding.ASCII.EncodingName,
            new EncoderReplacementFallback(string.Empty),
            new DecoderExceptionFallback()
            ),
        Encoding.UTF8.GetBytes(inputString)
    )
);

It may look cumbersome, but it should be intuitive. It uses the .NET ASCII encoding to convert a string. UTF8 is used during the conversion because it can represent any of the original characters. It uses an EncoderReplacementFallback to to convert any non-ASCII character to an empty string.

Ollayos answered 25/9, 2008 at 19:32 Comment(12)
Perfect! I'm using this to clean a string before saving it to a RTF document. Very much appreciated. Much easier to understand than the Regex version.Vedis
You really find it easier to understand? To me, all the stuff that's not really relevant (fallbacks, conversions to bytes etc) is drawing the attention away from what actually happens.Ollayos
@Brandon, actually, this technique doesn't do the job better than other techniques. So the analogy would be using a plain olde screwdriver instead of a fancy iScrewDriver Deluxe 2000. :)Ollayos
@bzim It's like using a hammer on a screw :) OK not. So it's like using the crankshaft of your car engine to drive a screw. There we go.Paraguay
How slow is this compared to regex? Regex is pretty fast.Pirate
@InsidiousForce, probably depends on which regular expression you use. Why don't you take one of the expressions from one of the answers to this question and benchmark it? :)Ollayos
One advantage is that I can easily replace ASCII with ISO 8859-1 or another encoding :)Bushwhack
We have a Foxpro DB that our system uses, that gets corrupted for a pasttime. Since this function is run on almost every field of every row I was curious to know the performance difference and if there was anything better than plain regexp. For 1,000 randomly generated unicode strings the run times are Regexp: Avg: 3~4ms, Max: 4ms and Encoding Conversion: Avg: 4~5ms, Max: 7ms (not including string generation, that is outside the timer)Lockman
@Lockman Interesting. This technique could probably be optimized, depending on what's taking time. The 2 Fallback instances could be re-used, for example.Ollayos
I'm finding this to be faster than the regex on smaller strings (they are nearly even on 1000 character string) and slower on larger stringsShu
Wondering if I could use this somehow to replace non-ascii characters with replacement character. for example: á would be replaced with a. Is this possible?Reinhart
@RageCompex The EncoderReplacementFallback wasn't designed for conversion. But what you want can be achieved using the .NET APIs for Unicode Normalization and Canonicalization.Ollayos
M
63

I believe MonsCamus meant:

parsememo = Regex.Replace(parsememo, @"[^\u0020-\u007E]", string.Empty);
Maclaine answered 2/8, 2013 at 13:31 Comment(1)
IMHO This answer is better than the accepted answer because it strips out control characters.Cothurnus
E
18

If you want not to strip, but to actually convert latin accented to non-accented characters, take a look at this question: How do I translate 8bit characters into 7bit characters? (i.e. Ü to U)

Electrodialysis answered 5/4, 2012 at 22:30 Comment(1)
I didn't even realize this was possible, but it's a much better solution for me. I'm going to add this link to a comment on the question to make it easier for other people to find. Thanks!Veinule
C
13

Inspired by philcruz's Regular Expression solution, I've made a pure LINQ solution

public static string PureAscii(this string source, char nil = ' ')
{
    var min = '\u0000';
    var max = '\u007F';
    return source.Select(c => c < min ? nil : c > max ? nil : c).ToText();
}

public static string ToText(this IEnumerable<char> source)
{
    var buffer = new StringBuilder();
    foreach (var c in source)
        buffer.Append(c);
    return buffer.ToString();
}

This is untested code.

Commix answered 27/1, 2010 at 19:0 Comment(5)
Instead of the separate ToText() method, how about replacing line 3 of PureAscii() with: return new string(source.Select(c => c < min ? nil : c > max ? nil : c).ToArray());Lysenko
Or perhaps ToText as: return (new string(source)).ToArray() - depending on what performs best. It's still nice to have ToText as an extension method - fluent/pipeline style. :-)Commix
That code replaces non-ASCII characters with a space. To strip them out, change Select to Where: return new string( source.Where( c => c >= min && c <= max ).ToArray() );Tattle
@Tattle That code allows you to specify which character to replace the non-ASCII characters with. By default it uses a space, but if it's called like .PureASCII(Char.MinValue), it will replace all non-ASCII with '\0' - which still isn't exactly stripping them, but similar results.Vespiary
The ToText method can be removed, and line 5 can be replaced by: return source.Where(c => c >= min && c <= max).Aggregate(new StringBuilder(), (sb, s) => sb.Append(s), sb => sb.ToString());Westerman
A
5

I found the following slightly altered range useful for parsing comment blocks out of a database, this means that you won't have to contend with tab and escape characters which would cause a CSV field to become upset.

parsememo = Regex.Replace(parsememo, @"[^\u001F-\u007F]", string.Empty);

If you want to avoid other special characters or particular punctuation check the ascii table

Ahola answered 1/10, 2012 at 10:2 Comment(1)
In case anyone hasn't noticed the other comments, the printable characters are actually @"[^\u0020-\u007E]". Here's a link to see the table if you're curious: asciitable.comKurtzig
F
5

no need for regex. just use encoding...

sOutput = System.Text.Encoding.ASCII.GetString(System.Text.Encoding.ASCII.GetBytes(sInput));
Fourway answered 18/6, 2013 at 17:38 Comment(3)
This does not work. This does not strip unicode characters, it replaces them with the ? character.Director
@Director is right. At least I got ????nacho?? when I tried: たまねこnachoなち in mono 3.4Antimatter
You can instantiate your own Encoding class that instead of replacing characters it removes them. See the GetEncoding method: msdn.microsoft.com/en-us/library/89856k4b(v=vs.110).aspxQuilmes
L
5

I came here looking for a solution for extended ascii characters, but couldnt find it. The closest I found is bzlm's solution. But that works only for ASCII Code upto 127(obviously you can replace the encoding type in his code, but i think it was a bit complex to understand. Hence, sharing this version). Here's a solution that works for extended ASCII codes i.e. upto 255 which is the ISO 8859-1

It finds and strips out non-ascii characters(greater than 255)

Dim str1 as String= "â, ??î or ôu🕧� n☁i✑💴++$-💯♓!🇪🚑🌚‼⁉4⃣od;/⏬'®;😁☕😁:☝)😁😁///😍1!@#"

Dim extendedAscii As Encoding = Encoding.GetEncoding("ISO-8859-1", 
                                                New EncoderReplacementFallback(String.empty),
                                                New DecoderReplacementFallback())

Dim extendedAsciiBytes() As Byte = extendedAscii.GetBytes(str1)

Dim str2 As String = extendedAscii.GetString(extendedAsciiBytes)

console.WriteLine(str2)
'Output : â, ??î or ôu ni++$-!‼⁉4od;/';:)///1!@#$%^yz:

Here's a working fiddle for the code

Replace the encoding as per the requirement, rest should remain the same.

Lim answered 11/10, 2016 at 21:38 Comment(1)
The only one that worked to remove ONLY the Ω from this string "Ω c ç ã". Thank you very much!Ecumenism
C
3

This is not optimal performance-wise, but a pretty straight-forward Linq approach:

string strippedString = new string(
    yourString.Where(c => c <= sbyte.MaxValue).ToArray()
    );

The downside is that all the "surviving" characters are first put into an array of type char[] which is then thrown away after the string constructor no longer uses it.

Calamitous answered 3/9, 2013 at 17:8 Comment(0)
C
1

I used this regex expression:

    string s = "søme string";
    Regex regex = new Regex(@"[^a-zA-Z0-9\s]", (RegexOptions)0);
    return regex.Replace(s, "");
Cochise answered 12/6, 2012 at 12:27 Comment(1)
This removes punctuation as well, just in case that's not what someone wants.Ungava
S
1

I use this regular expression to filter out bad characters in a filename.

Regex.Replace(directory, "[^a-zA-Z0-9\\:_\- ]", "")

That should be all the characters allowed for filenames.

Scalpel answered 9/6, 2017 at 18:17 Comment(2)
Nope. See Path.GetInvalidPathChars and Path.GetInvalidFileNameChars. So, there are tens of thousands of valid characters.Clerkly
You are correct, Tom. I was actually thinking of the common ones, but I left out parenthesis and curly braces as well as all these - ^%$#@!&+=.Scalpel
D
1
public string ReturnCleanASCII(string s)
    {
        StringBuilder sb = new StringBuilder(s.Length);
        foreach (char c in s)
        {
            if ((int)c > 127) // you probably don't want 127 either
                continue;
            if ((int)c < 32)  // I bet you don't want control characters 
                continue;
            if (c == '%')
                continue;
            if (c == '?')
                continue;
            sb.Append(c);
        }
        return sb.ToString();
    }
Dappled answered 27/7, 2022 at 8:18 Comment(0)
R
1

I did a bit of testing, and @bzlm 's answer is the fastest valid answer. But it turns out we can do much faster. The conversion using encoding is equivalent to the following code when inlining Encoding.Convert

public static string StripUnicode(string unicode) {
    Encoding dstEncoding = GreedyAscii;
    Encoding srcEncoding = Encoding.UTF8;
    return dstEncoding.GetString(dstEncoding.GetBytes(srcEncoding.GetChars(srcEncoding.GetBytes(unicode))));
}

As you can clearly see we perform two redundant actions by reencoding UTF8. Why is that you may ask? C# exclusively stores strings in UTF16 graphmemes. These can ofc also be UTF8 graphmemes, since unicode is intercompatible. (Sidenote: @bzlm 's solution breaks UTF16 characters which may throw an exception during transcoding.) => The operation is independant of the source encoding, since it always is UTF16.

Lets get rid of the redundant reencoding, and prevent edgecase failures.

public static string StripUnicode(string unicode) {
    Encoding dstEncoding = GreedyAscii;
    return dstEncoding.GetString(dstEncoding.GetBytes(unicode));
}

We alreadly have a simplified and perfectly workable solution. Which requries less then half as much time to compute.

There is not much more performance to be gained, but for further memory optimization we can do two things:

  1. Accept a ReadOnlySpan<char> for a more usable api.
  2. Attempt to fit the tempoary byte[] unto the stack; otherwise use an array pool.
public static string StripUnicode(ReadOnlySpan<char> unicode) {
    return EnsureEncoding(unicode, GreedyAscii);
}

/// <summary>Produces a string which is compatible with the limiting encoding</summary>
/// <remarks>Ensure that the encoding does not throw on illegal characters</remarks>
public static string EnsureEncoding(ReadOnlySpan<char> unicode, Encoding limitEncoding) {
    int asciiBytesLength = limitEncoding.GetMaxByteCount(unicode.Length);
    byte[]? asciiBytes = asciiBytesLength <= 2048 ? null : ArrayPool<byte>.Shared.Rent(asciiBytesLength);
    Span<byte> asciiSpan = asciiBytes ?? stackalloc byte[asciiBytesLength];

    asciiBytesLength = limitEncoding.GetBytes(unicode, asciiSpan);
    asciiSpan = asciiSpan.Slice(0, asciiBytesLength);

    string asciiChars = limitEncoding.GetString(asciiSpan);
    if (asciiBytes is { }) {
        ArrayPool<byte>.Shared.Return(asciiBytes);
    }

    return asciiChars;
}

private static Encoding GreedyAscii { get; } = Encoding.GetEncoding(Encoding.ASCII.EncodingName, new EncoderReplacementFallback(string.Empty), new DecoderExceptionFallback());

You can see this snipped in action on sharplab.io

Radiculitis answered 26/2, 2023 at 15:51 Comment(0)
A
0

If you want a string with only ISO-8859-1 characters and excluding characters which are not standard, you should use this expression :

var result = Regex.Replace(value, @"[^\u0020-\u007E\u00A0-\u00FF]+", string.Empty);

Note : Using Encoding.GetEncoding("ISO-8859-1") method will not do the job because undefined characters are not excluded.

.Net Fiddle sample

Wikipedia ISO-8859-1 code page for more details.

Accident answered 16/7, 2022 at 11:12 Comment(0)
B
0

Just decode unicode using by Regex.Unescape(s)

Bayou answered 10/3, 2023 at 8:20 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Tierell
D
0

You can use Char.IsAscii to identify the characters you want to keep. A simple implementation might look like:

public static string StripNonAscii(this string input)
{
    StringBuilder resultBuilder = new();
    foreach (char character in input)
        if (char.IsAscii(character))
            resultBuilder.Append(character);
    return resultBuilder.ToString();
}
Deactivate answered 23/3, 2023 at 19:44 Comment(0)
A
-1

Necromancing.
Also, the method by bzlm can be used to remove characters that are not in an arbitrary charset, not just ASCII:

// https://en.wikipedia.org/wiki/Code_page#EBCDIC-based_code_pages
// https://en.wikipedia.org/wiki/Windows_code_page#East_Asian_multi-byte_code_pages
// https://en.wikipedia.org/wiki/Chinese_character_encoding
System.Text.Encoding encRemoveAllBut = System.Text.Encoding.ASCII;
encRemoveAllBut = System.Text.Encoding.GetEncoding(System.Globalization.CultureInfo.InstalledUICulture.TextInfo.ANSICodePage); // System-encoding
encRemoveAllBut = System.Text.Encoding.GetEncoding(1252); // Western European (iso-8859-1)
encRemoveAllBut = System.Text.Encoding.GetEncoding(1251); // Windows-1251/KOI8-R
encRemoveAllBut = System.Text.Encoding.GetEncoding("ISO-8859-5"); // used by less than 0.1% of websites
encRemoveAllBut = System.Text.Encoding.GetEncoding(37); // IBM EBCDIC US-Canada
encRemoveAllBut = System.Text.Encoding.GetEncoding(500); // IBM EBCDIC Latin 1
encRemoveAllBut = System.Text.Encoding.GetEncoding(936); // Chinese Simplified
encRemoveAllBut = System.Text.Encoding.GetEncoding(950); // Chinese Traditional
encRemoveAllBut = System.Text.Encoding.ASCII; // putting ASCII again, as to answer the question 

// https://mcmap.net/q/11561/-how-can-you-strip-non-ascii-characters-from-a-string-in-c
string inputString = "RäksmörПривет, мирgås";
string asAscii = encRemoveAllBut.GetString(
    System.Text.Encoding.Convert(
        System.Text.Encoding.UTF8,
        System.Text.Encoding.GetEncoding(
            encRemoveAllBut.CodePage,
            new System.Text.EncoderReplacementFallback(string.Empty),
            new System.Text.DecoderExceptionFallback()
            ),
        System.Text.Encoding.UTF8.GetBytes(inputString)
    )
);

System.Console.WriteLine(asAscii);

AND for those that just want to remote the accents:
(caution, because Normalize != Latinize != Romanize)

// string str = Latinize("(æøå âôû?aè");
public static string Latinize(string stIn)
{
    // Special treatment for German Umlauts
    stIn = stIn.Replace("ä", "ae");
    stIn = stIn.Replace("ö", "oe");
    stIn = stIn.Replace("ü", "ue");

    stIn = stIn.Replace("Ä", "Ae");
    stIn = stIn.Replace("Ö", "Oe");
    stIn = stIn.Replace("Ü", "Ue");
    // End special treatment for German Umlauts

    string stFormD = stIn.Normalize(System.Text.NormalizationForm.FormD);
    System.Text.StringBuilder sb = new System.Text.StringBuilder();

    for (int ich = 0; ich < stFormD.Length; ich++)
    {
        System.Globalization.UnicodeCategory uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);

        if (uc != System.Globalization.UnicodeCategory.NonSpacingMark)
        {
            sb.Append(stFormD[ich]);
        } // End if (uc != System.Globalization.UnicodeCategory.NonSpacingMark)

    } // Next ich


    //return (sb.ToString().Normalize(System.Text.NormalizationForm.FormC));
    return (sb.ToString().Normalize(System.Text.NormalizationForm.FormKC));
} // End Function Latinize
Annabal answered 7/1, 2021 at 0:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.