How to recognize if a string contains unicode chars?
Asked Answered
D

5

50

I have a string and I want to know if it has unicode characters inside or not. (if its fully contains ASCII or not)

How can I achieve that?

Thanks!

Dalpe answered 16/12, 2010 at 10:13 Comment(3)
I think you need to tell us more, since all strings in .NET are unicode. Are you afraid you're going to lose some characters in an encoding process? If so, please tell us what you intend to use the knowledge for.Achromatic
I want to know if something complies with ASCII or not... (fully comply)Dalpe
use a regex- this would be a related question A regex can be used to replace or to match. The following answer is about replacing, but you can use a regex for matching too #7411938Anoint
O
81

If my assumptions are correct you wish to know if your string contains any "non-ANSI" characters. You can derive this as follows.

    public void test()
    {
        const string WithUnicodeCharacter = "a hebrew character:\uFB2F";
        const string WithoutUnicodeCharacter = "an ANSI character:Æ";

        bool hasUnicode;

        //true
        hasUnicode = ContainsUnicodeCharacter(WithUnicodeCharacter);
        Console.WriteLine(hasUnicode);

        //false
        hasUnicode = ContainsUnicodeCharacter(WithoutUnicodeCharacter);
        Console.WriteLine(hasUnicode);
    }

    public bool ContainsUnicodeCharacter(string input)
    {
        const int MaxAnsiCode = 255;

        return input.Any(c => c > MaxAnsiCode);
    }

Update

This will detect for extended ASCII. If you only detect for the true ASCII character range (up to 127), then you could potentially get false positives for extended ASCII characters which does not denote Unicode. I have alluded to this in my sample.

Open answered 16/12, 2010 at 10:25 Comment(1)
This is incorrect. A C# char is a unicode UTF-16 character. Only up to 127 are the characters the same as in ASCII. The ASCII extended range will be different depending on the locale used, i.e. ANSI not Extended ASCII. So for English ISO-8859-1 the characters will match UTF-16 but they won't be the same characters in other locales. See the comparison table here: en.wikipedia.org/wiki/ISO/IEC_8859.Lanate
A
16

If a string contains only ASCII characters, a serialization + deserialization step using ASCII encoding should get back the same string so a one liner check in c# could look like..

String s1="testभारत";
bool isUnicode= System.Text.ASCIIEncoding.GetEncoding(0).GetString(System.Text.ASCIIEncoding.GetEncoding(0).GetBytes(s1)) != s1;
Afflictive answered 22/8, 2017 at 20:17 Comment(3)
It does not work for say russian test: System.Text.ASCIIEncoding.GetEncoding(0).GetString(System.Text.ASCIIEncoding.GetEncoding(0).GetBytes("фы")) != "фы" returns False.Strike
i tested your exact statement in a console application and it returns True for me.Afflictive
I have tested this in linqPad - it returns false.Strike
M
6

ASCII defines only character codes in the range 0-127. Unicode is explicitly defined such as to overlap in that same range with ASCII. Thus, if you look at the character codes in your string, and it contains anything that is higher than 127, the string contains Unicode characters that are not ASCII characters.

Note, that ASCII includes only the English alphabet. Thus, if you (for whatever reason) need to apply that same approach to strings that might contain accented characters (Spanish text for example), ASCII is not sufficient and you need to look for another differentiator.

ANSI character set [*] does extends the ASCII characters with the aforementioned accented Latin characters in the range 128-255. However, Unicode does not overlap with ANSI in that range, so technically an Unicode string might contain characters that are not part of ANSI, but have the same character code (specifically in the range 128-159, as you can see from the table I linked to).

As for the actual code to do this, @chibacity answer should work, although you should modify it to cover strict ASCII, because it won't work for ANSI.

[*] Also known as Latin 1 Windows (Win-1252)

Macedonia answered 16/12, 2010 at 10:58 Comment(0)
C
1

As long as it contains characters, it contains Unicode characters.

From System.String:

Represents text as a series of Unicode characters.

public static bool ContainsUnicodeChars(string text)
{
   return !string.IsNullOrEmpty(text);
}

You normally have to worry about different Unicode encodings when you have to:

  1. Encode a string into a stream of bytes with a particular encoding.
  2. Decode a string from a stream of bytes with a particular encoding.

Once you're into string land though, the encoding that the string was originally represented with, if any, is irrelevant.

Each character in a string is defined by a Unicode scalar value, also called a Unicode code point or the ordinal (numeric) value of the Unicode character. Each code point is encoded by using UTF-16 encoding, and the numeric value of each element of the encoding is represented by a Char object.

Perhaps you might also find these questions relevant:

How can you strip non-ASCII characters from a string? (in C#)

C# Ensure string contains only ASCII

And this article by Jon Skeet: Unicode and .NET

Cringle answered 16/12, 2010 at 10:16 Comment(2)
Unicode is a superset of ASCII. The question is clearly about how to determine if the string only uses ASCII characters. So this answer seems unnecessarily pedantic to me...Coniine
@Zero3: The edit to the question was made after my answer.Cringle
A
1

This is another solution without using lambda expresions. It is in VB.NET but you can convert it easily to C#:

   Public Function ContainsUnicode(ByVal inputstr As String) As Boolean
        Dim inputCharArray() As Char = inputstr.ToCharArray

        For i As Integer = 0 To inputCharArray.Length - 1
            If CInt(AscW(inputCharArray(i))) > 255 Then Return True
        Next
        Return False
   End Function
Aldaaldan answered 26/10, 2016 at 3:1 Comment(2)
There are only 128 characters in ASCII, so the > 255 does not appear to be correct.Coniine
There are 256 characters including the extended ascii character codes based on this table ascii-code.comAldaaldan

© 2022 - 2024 — McMap. All rights reserved.