How to convert (transliterate) a string from utf8 to ASCII (single byte) in c#?
Asked Answered
P

6

11

I have a string object

"with multiple characters and even special characters"

I am trying to use

UTF8Encoding utf8 = new UTF8Encoding();
ASCIIEncoding ascii = new ASCIIEncoding();

objects in order to convert that string to ascii. May I ask someone to bring some light to this simple task, that is hunting my afternoon.

EDIT 1: What we are trying to accomplish is getting rid of special characters like some of the special windows apostrophes. The code that I posted below as an answer will not take care of that. Basically

O'Brian will become O?Brian. where ' is one of the special apostrophes

Precast answered 31/1, 2009 at 0:14 Comment(1)
Note that if you want to replace accented characters with their unaccented equivalents, you can use str.Normalize(NormalizationForm.FormKD)Applicant
M
20

This was in response to your other question, that looks like it's been deleted....the point still stands.

Looks like a classic Unicode to ASCII issue. The trick would be to find where it's happening.

.NET works fine with Unicode, assuming it's told it's Unicode to begin with (or left at the default).

My guess is that your receiving app can't handle it. So, I'd probably use the ASCIIEncoder with an EncoderReplacementFallback with String.Empty:

using System.Text;

string inputString = GetInput();
var encoder = ASCIIEncoding.GetEncoder();
encoder.Fallback = new EncoderReplacementFallback(string.Empty);

byte[] bAsciiString = encoder.GetBytes(inputString);

// Do something with bytes...
// can write to a file as is
File.WriteAllBytes(FILE_NAME, bAsciiString);
// or turn back into a "clean" string
string cleanString = ASCIIEncoding.GetString(bAsciiString); 
// since the offending bytes have been removed, can use default encoding as well
Assert.AreEqual(cleanString, Default.GetString(bAsciiString));

Of course, in the old days, we'd just loop though and remove any chars greater than 127...well, those of us in the US at least. ;)

Morman answered 31/1, 2009 at 0:34 Comment(3)
Thanks it worked perfectly. I just had to make a small change. Encoding encoder = ASCIIEncoding.GetEncoding("us-ascii", new EncoderReplacementFallback(string.Empty), new DecoderExceptionFallback());Precast
+1 for EncoderReplacementFallback - I had never heard of that before. Love it.Corrosion
EncoderReplacementFallback with a question mark is the default. In this case, it seems a "better lossy" is desirable. An exception fallback is useful when lossy is intolerable (which IMHO should be the default).Bankhead
P
12

I was able to figure it out. In case someone wants to know below the code that worked for me:

ASCIIEncoding ascii = new ASCIIEncoding();
byte[] byteArray = Encoding.UTF8.GetBytes(sOriginal);
byte[] asciiArray = Encoding.Convert(Encoding.UTF8, Encoding.ASCII, byteArray);
string finalString = ascii.GetString(asciiArray);

Let me know if there is a simpler way o doing it.

Precast answered 31/1, 2009 at 0:25 Comment(2)
It's worth noting that if the string contains characters which can't be represented in ASCII, it won't be the same string after conversion. It might be missing those characters or it might become garbled, depending on how Encoding.Convert works (which I don't know).Guillotine
Actually I just tested some scenarios and what you are saying is true. Do you know how to overcome this limitation. For example if I have one of the special apostrophes to replace it with the common one.Precast
A
7

For anyone who likes Extension methods, this one does the trick for us.

using System.Text;

namespace System
{
    public static class StringExtension
    {
        private static readonly ASCIIEncoding asciiEncoding = new ASCIIEncoding();

        public static string ToAscii(this string dirty)
        {
            byte[] bytes = asciiEncoding.GetBytes(dirty);
            string clean = asciiEncoding.GetString(bytes);
            return clean;
        }
    }
}

(System namespace so it's available pretty much automatically for all of our strings.)

Antefix answered 2/2, 2012 at 21:48 Comment(0)
E
6

Based on Mark's answer above (and Geo's comment), I created a two liner version to remove all ASCII exception cases from a string. Provided for people searching for this answer (as I did).

using System.Text;

// Create encoder with a replacing encoder fallback
var encoder = ASCIIEncoding.GetEncoding("us-ascii", 
    new EncoderReplacementFallback(string.Empty), 
    new DecoderExceptionFallback());

string cleanString = encoder.GetString(encoder.GetBytes(dirtyString)); 
Elevon answered 28/3, 2014 at 9:33 Comment(0)
P
2

If you want 8 bit representation of characters that used in many encoding, this may help you.

You must change variable targetEncoding to whatever encoding you want.

Encoding targetEncoding = Encoding.GetEncoding(874); // Your target encoding
Encoding utf8 = Encoding.UTF8;

var stringBytes = utf8.GetBytes(Name);
var stringTargetBytes = Encoding.Convert(utf8, targetEncoding, stringBytes);
var ascii8BitRepresentAsCsString = Encoding.GetEncoding("Latin1").GetString(stringTargetBytes);
Partida answered 17/7, 2016 at 14:23 Comment(0)
S
0

Here is code to transliterate unicode chars to their closest ascii version where possible. Remove/fix accents, macrons, typesetters colons, dashes, curly quotes, apostrophes, dashes, invisible spaces, and other bad chars.

This is useful if you need to feed data into another system that does not support unicode. Code is fast by using stringbuilder and simple loop (tested 8,000 char string processed 10,000x = 1.1sec).

Address:123 East Tāmaki – Tāmaki“ ” GötheФ€ O’Briens ‘hello’ he said!

outputs ->

Address:123 East Tamaki - Tamaki" " Gothe O'Briens 'hello' he said!

    /// <summary>
    /// Transliterate all unicode chars to their closest ascii version
    /// Remove/fix accents, maori macrons, typesetters colons, dashes, curly quotes, apostrophes, dashes, invisible spaces, and other bad chars
    /// 1. remove accents but keep the letters
    /// 2. fix punctuation to the closest ascii punctuation
    /// 3. remove any remaining non ascii chars
    /// 4. also remove any invisible control chars
    /// Option: remove line breaks or keep them
    /// </summary>
    /// <example>"CHASSIS NO.:LC0CE4CB3N0345426 East Tāmaki – East Tāmaki“ ” GötheФ€ O’Briens ‘hello’ he said!" outputs "CHASSIS NO.:LC0CE4CB3N0345426 East Tamaki - East Tamaki" " Gothe O'Briens 'hello' he said!"</example>
    public static string CleanUnicodeTransliterateToAscii(string text, bool removeLineBreaks) {
        if (text == null) return null;

        // decomposes accented letters into the letter and the diacritic, fixes wacky punctuation to closest common punctuation
        text = text.Normalize(NormalizationForm.FormKD);

        // loop all chars after converting all punctuation to the closest (fix curly quotes etc)
        var stringBuilder = new StringBuilder();
        foreach (var c in text) {
            var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
            if (c == '\r' || c == '\n') {
                if (removeLineBreaks) {
                    // skip
                } else {
                    stringBuilder.Append(c);
                }
            } else if (unicodeCategory == UnicodeCategory.Control) {
                // control char - skip
            } else if (unicodeCategory == UnicodeCategory.NonSpacingMark) {
                // diacritic mark/accent - skip             
            } else if (c == '‘' || c == '’') {
                // single curly quote or apostrophe add apostrophe
                stringBuilder.Append("'");
            } else if (unicodeCategory == UnicodeCategory.InitialQuotePunctuation || unicodeCategory == UnicodeCategory.FinalQuotePunctuation) {
                // any other quote add a normal straight quote
                stringBuilder.Append("\"");
            } else if (unicodeCategory == UnicodeCategory.DashPunctuation) {
                stringBuilder.Append("-");
            } else if (unicodeCategory == UnicodeCategory.SpaceSeparator) {
                // add a normal space
                stringBuilder.Append(" ");
            } else if (c > 255) {
                // skip any remaining non ascii chars
            } else {
                stringBuilder.Append(c);
            }
        }
        text = stringBuilder.ToString();
        return text;
    }
Suggestive answered 9/1 at 6:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.