How can you strip non-ASCII characters from a string? (in C#)
string s = "søme string";
s = Regex.Replace(s, @"[^\u0000-\u007F]+", string.Empty);
The ^
is the not operator. It tells the regex to find everything that doesn't match, instead of everything that does match. The \u####-\u####
says which characters match.\u0000-\u007F
is the equivalent of the first 128 characters in utf-8 or unicode, which are always the ascii characters. So you match every non ascii character (because of the not) and do a replace on everything that matches.
(as explained in a comment by Gordon Tucker Dec 11, 2009 at 21:11)
Here is a pure .NET solution that doesn't use regular expressions:
string inputString = "Räksmörgås";
string asAscii = Encoding.ASCII.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.GetEncoding(
Encoding.ASCII.EncodingName,
new EncoderReplacementFallback(string.Empty),
new DecoderExceptionFallback()
),
Encoding.UTF8.GetBytes(inputString)
)
);
It may look cumbersome, but it should be intuitive. It uses the .NET ASCII encoding to convert a string. UTF8 is used during the conversion because it can represent any of the original characters. It uses an EncoderReplacementFallback to to convert any non-ASCII character to an empty string.
Regexp: Avg: 3~4ms, Max: 4ms
and Encoding Conversion: Avg: 4~5ms, Max: 7ms
(not including string generation, that is outside the timer) –
Lockman á
would be replaced with a
. Is this possible? –
Reinhart I believe MonsCamus meant:
parsememo = Regex.Replace(parsememo, @"[^\u0020-\u007E]", string.Empty);
If you want not to strip, but to actually convert latin accented to non-accented characters, take a look at this question: How do I translate 8bit characters into 7bit characters? (i.e. Ü to U)
Inspired by philcruz's Regular Expression solution, I've made a pure LINQ solution
public static string PureAscii(this string source, char nil = ' ')
{
var min = '\u0000';
var max = '\u007F';
return source.Select(c => c < min ? nil : c > max ? nil : c).ToText();
}
public static string ToText(this IEnumerable<char> source)
{
var buffer = new StringBuilder();
foreach (var c in source)
buffer.Append(c);
return buffer.ToString();
}
This is untested code.
return new string( source.Where( c => c >= min && c <= max ).ToArray() );
–
Tattle return source.Where(c => c >= min && c <= max).Aggregate(new StringBuilder(), (sb, s) => sb.Append(s), sb => sb.ToString());
–
Westerman I found the following slightly altered range useful for parsing comment blocks out of a database, this means that you won't have to contend with tab and escape characters which would cause a CSV field to become upset.
parsememo = Regex.Replace(parsememo, @"[^\u001F-\u007F]", string.Empty);
If you want to avoid other special characters or particular punctuation check the ascii table
no need for regex. just use encoding...
sOutput = System.Text.Encoding.ASCII.GetString(System.Text.Encoding.ASCII.GetBytes(sInput));
????nacho??
when I tried: たまねこnachoなち
in mono 3.4 –
Antimatter I came here looking for a solution for extended ascii characters, but couldnt find it. The closest I found is bzlm's solution. But that works only for ASCII Code upto 127(obviously you can replace the encoding type in his code, but i think it was a bit complex to understand. Hence, sharing this version). Here's a solution that works for extended ASCII codes i.e. upto 255 which is the ISO 8859-1
It finds and strips out non-ascii characters(greater than 255)
Dim str1 as String= "â, ??î or ôu🕧� n☁i✑💴++$-💯♓!🇪🚑🌚‼⁉4⃣od;/⏬'®;😁☕😁:☝)😁😁///😍1!@#"
Dim extendedAscii As Encoding = Encoding.GetEncoding("ISO-8859-1",
New EncoderReplacementFallback(String.empty),
New DecoderReplacementFallback())
Dim extendedAsciiBytes() As Byte = extendedAscii.GetBytes(str1)
Dim str2 As String = extendedAscii.GetString(extendedAsciiBytes)
console.WriteLine(str2)
'Output : â, ??î or ôu ni++$-!‼⁉4od;/';:)///1!@#$%^yz:
Here's a working fiddle for the code
Replace the encoding as per the requirement, rest should remain the same.
This is not optimal performance-wise, but a pretty straight-forward Linq approach:
string strippedString = new string(
yourString.Where(c => c <= sbyte.MaxValue).ToArray()
);
The downside is that all the "surviving" characters are first put into an array of type char[]
which is then thrown away after the string
constructor no longer uses it.
I used this regex expression:
string s = "søme string";
Regex regex = new Regex(@"[^a-zA-Z0-9\s]", (RegexOptions)0);
return regex.Replace(s, "");
I use this regular expression to filter out bad characters in a filename.
Regex.Replace(directory, "[^a-zA-Z0-9\\:_\- ]", "")
That should be all the characters allowed for filenames.
public string ReturnCleanASCII(string s)
{
StringBuilder sb = new StringBuilder(s.Length);
foreach (char c in s)
{
if ((int)c > 127) // you probably don't want 127 either
continue;
if ((int)c < 32) // I bet you don't want control characters
continue;
if (c == '%')
continue;
if (c == '?')
continue;
sb.Append(c);
}
return sb.ToString();
}
I did a bit of testing, and @bzlm 's answer is the fastest valid answer.
But it turns out we can do much faster.
The conversion using encoding is equivalent to the following code when inlining Encoding.Convert
public static string StripUnicode(string unicode) {
Encoding dstEncoding = GreedyAscii;
Encoding srcEncoding = Encoding.UTF8;
return dstEncoding.GetString(dstEncoding.GetBytes(srcEncoding.GetChars(srcEncoding.GetBytes(unicode))));
}
As you can clearly see we perform two redundant actions by reencoding UTF8. Why is that you may ask? C# exclusively stores strings in UTF16 graphmemes. These can ofc also be UTF8 graphmemes, since unicode is intercompatible. (Sidenote: @bzlm 's solution breaks UTF16 characters which may throw an exception during transcoding.) => The operation is independant of the source encoding, since it always is UTF16.
Lets get rid of the redundant reencoding, and prevent edgecase failures.
public static string StripUnicode(string unicode) {
Encoding dstEncoding = GreedyAscii;
return dstEncoding.GetString(dstEncoding.GetBytes(unicode));
}
We alreadly have a simplified and perfectly workable solution. Which requries less then half as much time to compute.
There is not much more performance to be gained, but for further memory optimization we can do two things:
- Accept a
ReadOnlySpan<char>
for a more usable api. - Attempt to fit the tempoary
byte[]
unto the stack; otherwise use an array pool.
public static string StripUnicode(ReadOnlySpan<char> unicode) {
return EnsureEncoding(unicode, GreedyAscii);
}
/// <summary>Produces a string which is compatible with the limiting encoding</summary>
/// <remarks>Ensure that the encoding does not throw on illegal characters</remarks>
public static string EnsureEncoding(ReadOnlySpan<char> unicode, Encoding limitEncoding) {
int asciiBytesLength = limitEncoding.GetMaxByteCount(unicode.Length);
byte[]? asciiBytes = asciiBytesLength <= 2048 ? null : ArrayPool<byte>.Shared.Rent(asciiBytesLength);
Span<byte> asciiSpan = asciiBytes ?? stackalloc byte[asciiBytesLength];
asciiBytesLength = limitEncoding.GetBytes(unicode, asciiSpan);
asciiSpan = asciiSpan.Slice(0, asciiBytesLength);
string asciiChars = limitEncoding.GetString(asciiSpan);
if (asciiBytes is { }) {
ArrayPool<byte>.Shared.Return(asciiBytes);
}
return asciiChars;
}
private static Encoding GreedyAscii { get; } = Encoding.GetEncoding(Encoding.ASCII.EncodingName, new EncoderReplacementFallback(string.Empty), new DecoderExceptionFallback());
You can see this snipped in action on sharplab.io
If you want a string with only ISO-8859-1 characters and excluding characters which are not standard, you should use this expression :
var result = Regex.Replace(value, @"[^\u0020-\u007E\u00A0-\u00FF]+", string.Empty);
Note : Using Encoding.GetEncoding("ISO-8859-1") method will not do the job because undefined characters are not excluded.
Wikipedia ISO-8859-1 code page for more details.
Just decode unicode using by Regex.Unescape(s)
You can use Char.IsAscii
to identify the characters you
want to keep. A simple implementation might look like:
public static string StripNonAscii(this string input)
{
StringBuilder resultBuilder = new();
foreach (char character in input)
if (char.IsAscii(character))
resultBuilder.Append(character);
return resultBuilder.ToString();
}
Necromancing.
Also, the method by bzlm can be used to remove characters that are not in an arbitrary charset, not just ASCII:
// https://en.wikipedia.org/wiki/Code_page#EBCDIC-based_code_pages
// https://en.wikipedia.org/wiki/Windows_code_page#East_Asian_multi-byte_code_pages
// https://en.wikipedia.org/wiki/Chinese_character_encoding
System.Text.Encoding encRemoveAllBut = System.Text.Encoding.ASCII;
encRemoveAllBut = System.Text.Encoding.GetEncoding(System.Globalization.CultureInfo.InstalledUICulture.TextInfo.ANSICodePage); // System-encoding
encRemoveAllBut = System.Text.Encoding.GetEncoding(1252); // Western European (iso-8859-1)
encRemoveAllBut = System.Text.Encoding.GetEncoding(1251); // Windows-1251/KOI8-R
encRemoveAllBut = System.Text.Encoding.GetEncoding("ISO-8859-5"); // used by less than 0.1% of websites
encRemoveAllBut = System.Text.Encoding.GetEncoding(37); // IBM EBCDIC US-Canada
encRemoveAllBut = System.Text.Encoding.GetEncoding(500); // IBM EBCDIC Latin 1
encRemoveAllBut = System.Text.Encoding.GetEncoding(936); // Chinese Simplified
encRemoveAllBut = System.Text.Encoding.GetEncoding(950); // Chinese Traditional
encRemoveAllBut = System.Text.Encoding.ASCII; // putting ASCII again, as to answer the question
// https://mcmap.net/q/11561/-how-can-you-strip-non-ascii-characters-from-a-string-in-c
string inputString = "RäksmörПривет, мирgås";
string asAscii = encRemoveAllBut.GetString(
System.Text.Encoding.Convert(
System.Text.Encoding.UTF8,
System.Text.Encoding.GetEncoding(
encRemoveAllBut.CodePage,
new System.Text.EncoderReplacementFallback(string.Empty),
new System.Text.DecoderExceptionFallback()
),
System.Text.Encoding.UTF8.GetBytes(inputString)
)
);
System.Console.WriteLine(asAscii);
AND for those that just want to remote the accents:
(caution, because Normalize != Latinize != Romanize)
// string str = Latinize("(æøå âôû?aè");
public static string Latinize(string stIn)
{
// Special treatment for German Umlauts
stIn = stIn.Replace("ä", "ae");
stIn = stIn.Replace("ö", "oe");
stIn = stIn.Replace("ü", "ue");
stIn = stIn.Replace("Ä", "Ae");
stIn = stIn.Replace("Ö", "Oe");
stIn = stIn.Replace("Ü", "Ue");
// End special treatment for German Umlauts
string stFormD = stIn.Normalize(System.Text.NormalizationForm.FormD);
System.Text.StringBuilder sb = new System.Text.StringBuilder();
for (int ich = 0; ich < stFormD.Length; ich++)
{
System.Globalization.UnicodeCategory uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
if (uc != System.Globalization.UnicodeCategory.NonSpacingMark)
{
sb.Append(stFormD[ich]);
} // End if (uc != System.Globalization.UnicodeCategory.NonSpacingMark)
} // Next ich
//return (sb.ToString().Normalize(System.Text.NormalizationForm.FormC));
return (sb.ToString().Normalize(System.Text.NormalizationForm.FormKC));
} // End Function Latinize
© 2022 - 2024 — McMap. All rights reserved.