Remove all non-ASCII characters from string
Asked Answered
Y

8

49

I have a C# routine that imports data from a CSV file, matches it against a database and then rewrites it to a file. The source file seems to have a few non-ASCII characters that are fouling up the processing routine.

I already have a static method that I run each input field through but it performs basic checks like removing commas and quotes. Does anybody know how I could add functionality that removes non-ASCII characters too?

Yen answered 5/10, 2009 at 23:18 Comment(0)
J
45
string sOut = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(s))
Jaffa answered 5/10, 2009 at 23:22 Comment(2)
Important to note that using asciiencoding will replace all non-ascii characters with '?'(63), which may or may not be what you want or expect.Rochelle
furthermore, you can check if it contains only ASCII, if s == sOutRexford
R
59

Here a simple solution:

public static bool IsASCII(this string value)
{
    // ASCII encoding replaces non-ascii with question marks, so we use UTF8 to see if multi-byte sequences are there
    return Encoding.UTF8.GetByteCount(value) == value.Length;
}

source: http://snipplr.com/view/35806/

Rexford answered 3/1, 2013 at 18:58 Comment(4)
This solution has the benefit of working in portable class libraries, where Encoding.ASCII is not available.Schoolmaster
It also has the benefit of being a lot faster than the accepted solution because it does not need to actually create an encoded string.Whitsuntide
-1; the question asked for "functionality that removes non-ASCII characters", which this doesn't do. The title was ambiguous, but the solution to that is to clarify the title (which I've done), not to answer a question that the OP didn't ask. This might be a good answer to a different question than the one you've posted it on, but is a non-answer to the one you did.Demarcate
you are genius!Unhitch
J
45
string sOut = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(s))
Jaffa answered 5/10, 2009 at 23:22 Comment(2)
Important to note that using asciiencoding will replace all non-ascii characters with '?'(63), which may or may not be what you want or expect.Rochelle
furthermore, you can check if it contains only ASCII, if s == sOutRexford
L
15

Do it all at once

public string ReturnCleanASCII(string s)
{
    StringBuilder sb = new StringBuilder(s.Length);
    foreach(char c in s)
    {
       if((int)c > 127) // you probably don't want 127 either
          continue;
       if((int)c < 32)  // I bet you don't want control characters 
          continue;
       if(c == ',')
          continue;
       if(c == '"')
          continue;
       sb.Append(c);
    }
    return sb.ToString();
}
Lakendra answered 1/8, 2016 at 13:22 Comment(1)
I would want tab, line feed and carriage return (9, 10, 13), so I just added if ((int)c == 9 || (int)c == 10 || (int)c == 13) as the first if and append it.Extradite
U
8

If you wanted to test a specific character, you could use

if ((int)myChar <= 127)

Just getting the ASCII encoding of the string will not tell you that a specific character was non-ASCII to begin with (if you care about that). See MSDN.

Urrutia answered 5/10, 2009 at 23:30 Comment(0)
P
7

Here's an improvement upon the accepted answer:

string fallbackStr = "";

Encoding enc = Encoding.GetEncoding(Encoding.ASCII.CodePage,
  new EncoderReplacementFallback(fallbackStr),
  new DecoderReplacementFallback(fallbackStr));

string cleanStr = enc.GetString(enc.GetBytes(inputStr));

This method will replace unknown characters with the value of fallbackStr, or if fallbackStr is empty, leave them out entirely. (Note that enc can be defined outside the scope of a function.)

Pecos answered 26/8, 2016 at 18:11 Comment(0)
B
2

It sounds kind of strange that it's accepted to drop the non-ASCII.

Also I always recommend the excellent FileHelpers library for parsing CSV-files.

Breechloader answered 5/10, 2009 at 23:29 Comment(0)
C
1
strText = Regex.Replace(strText, @"[^\u0020-\u007E]", string.Empty);
Coincident answered 13/4, 2022 at 14:43 Comment(1)
Remember that Stack Overflow isn't just intended to solve the immediate problem, but also to help future readers find solutions to similar problems, which requires understanding the underlying code. This is especially important for members of our community who are beginners, and not familiar with the syntax. Given that, can you edit your answer to include an explanation of what you're doing and why you believe it is the best approach?Shirleyshirlie
B
0
    public string RunCharacterCheckASCII(string s)
    {
        string str = s;
        bool is_find = false;
        char ch;
        int ich = 0;
        try
        {
            char[] schar = str.ToCharArray();
            for (int i = 0; i < schar.Length; i++)
            {
                ch = schar[i];
                ich = (int)ch;
                if (ich > 127) // not ascii or extended ascii
                {
                    is_find = true;
                    schar[i] = '?';
                }
            }
            if (is_find)
                str = new string(schar);
        }
        catch (Exception ex)
        {
        }
        return str;
    }
Bound answered 8/6, 2016 at 1:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.