What version of Unicode is supported by which .NET platform and on which version of Windows in regards to character classes?

Asked 6/2, 2012 at 15:4 Answered 12/5, 2015 at 15:15

Solved c#.net utf-16 ucs2 astral-plane

Updated question ¹

With regards to character classes, comparison, sorting, normalization and collations, what Unicode version or versions are supported by which .NET platforms?

Original question

I remember somewhat vaguely having read that .NET supported Unicode version 3.0 and that the internal UTF-16 encoding is not really UTF-16 but actually uses UCS-2, which is not the same. It seems, for instance, that characters above U+FFFF are not possible, i.e. consider:

string s = "\u1D7D9"; // ("Mathematical double-struck digit one")

and it stores the string "ᵽ9".

I'm basically looking for definitive references of answers to the following:

If it isn't true UTF-16 in .NET, what is it?
What version of Unicode is supported by .NET?
If recent versions are not supported or planned in the near future, does anybody know of a (non)commercial library or how I can workaround this issue?

¹) I updated the question as with passing time, it seems more appropriate with respect to the answers and to the larger community. I left the original question in place of which parts have been answered in the comments. Also the old UCS-2 (no surrogates) was used in now-ancient 32 bit Windows versions, .NET has always used UTF-16 (with surrogates) internally.

Discontinuity answered 6/2, 2012 at 15:4 Comment(6)

What exactly are you trying to do with those characters? Put them in a webpage with ASP.NET? Display them in a WPF or WinForms interface? – Felecia 6/2, 2012 at 15:15

What does "it doesn't seem to work" mean in this context? – Latticed 6/2, 2012 at 15:47

@JoeStrommen: we're implementing a new XML-based data transformation toolset, and I'm trying to found out whether I can say "we support Unicode up to 6.0" or whether we should say something else. In addition, I'm trying to find out how we could bypass possible limitations in .NET. – Discontinuity 6/2, 2012 at 15:52

@Gabe: I updated my question, hopefully it's clearer now. – Discontinuity 6/2, 2012 at 15:56

Oh, it looks like you were just using the wrong escape mechanism in C# -- it has nothing to do with .NET. Your string was interpreted as "\u1D7D" + "9". You just need "\U0001D7D9". – Latticed 6/2, 2012 at 16:1

@Gabe: indeed, I wasn't aware of \U (never needed it before I guess) and then wrongly concluded that there was no support for higher planes. – Discontinuity 6/2, 2012 at 16:25

Internally, .NET is UTF-16. In some cases, e.g. when ASP.NET writes to a response, by default it uses UTF-8. Both of them can handle higher planes.

The reason people sometimes refer to .NET as UCS2 is (I think, because I see few other reasons) that Char is strictly 16 bit and a single Char can't be used to represent the upper planes. Char does, however, have static method overloads (e.g. Char.IsLetter) that can operate on high plane UTF-16 characters inside a string. Strings are stored as true UTF-16.

You can address high Unicode codepoints directly using uppercase \U - e.g. "\U0001D7D9" - but again, only inside strings, not chars.

As for Unicode version, from the MSDN documentation:

"In the .NET Framework 4, sorting, casing, normalization, and Unicode character information is synchronized with Windows 7 and conforms to the Unicode 5.1 standard."

Update 1: It's worth noting, however, that this does not imply that the entirety of Unicode 5.1 is supported - neither in Windows 7 nor in .NET 4.0

Windows 8 targets Unicode 6.0 - I'm guessing that .NET Framework 4.5 might synchronize with that, but have found no sources confirming it. And once again, that doesn't mean the entire standard is implemented.

Update 2: This note on Roslyn confirms that the underlying platform defines the Unicode support for the compiler, and in the link to the code it explains that C# 6.0 supports Unicode 6.0 and up (with a breaking change for C# identifiers as a result).

Update 3: Since .NET version 4.5 a new class SortVersion is introduced to get the supported Unicode version by calling the static property SortVersion.FullVersion. On the same page, Microsoft explains that .NET 4.0 supports Unicode 5.0 on all platforms and .NET 4.5 supports Unicode 5.0 on Windows 7 and Unicode 6.0 on Windows 8. This slightly contrasts the official "what is new" statement here, which talks of version 5.x and 6.0 respectively. From my own (editor: Abel) experience, in most cases it seems that in .NET 4.0, Unicode 5.1 is supported at least for character classes, but I didn't test sorting, normalization and collations. This seems in line with what is said in MSDN as quoted above.

Cooperative answered 6/2, 2012 at 15:49 Comment(6)

Good observation about char. I notice indeed that char uni = "\U0002B740".ToCharArray()[0]; shows "55405", which is only one half of the UTF-16 surrogate pair. It follows from your reference that trying Char.IsLetter on \u0526 (incorrectly) shows false, because it was only introduced with Unicode 6. – Discontinuity 6/2, 2012 at 16:20

(accepting this because you showed the reference I was looking for and too stupid to find at is obvious location, however, the other answers are valuable in their own right) – Discontinuity 6/2, 2012 at 16:24

This might be a helpful point of origin for getting information for single characters: MSDN link. Since char cannot contain more than one half, the StringInfo methods return a string instead, with the complete UTF-16 pair (if the character is a pair - otherwise it just returns the single char - as a string, or character + combining characters for combining diacritics). – Cooperative 6/2, 2012 at 16:41

This makes much more sense now. The C# Language Spec considers char an unsigned 16-bit integral type. So it would seem it that it was designed to have a fixed-width, which would explain its lack of support for UTF-16 surrogates. – Prent 19/12, 2018 at 15:0

"Since .NET version 4.5 a new class SortVersion is introduced to get the supported Unicode version by calling the static property SortVersion.FullVersion" -- SortVersion.FullVersion isn't static – Filia 11/12, 2020 at 9:59

For the mapping from .NET platforms to unicode standards see learn.microsoft.com/en-us/dotnet/api/… – Louisville 15/11, 2022 at 17:7

That character is supported. One thing to note is that for unicode characters with more than 2 bytes, you must declare them with an uppercase '\U', like this:

string text = "\U0001D7D9"

If you create a WPF app with that character in a text block, it should render the double-one character perfectly.

Felecia answered 6/2, 2012 at 15:42 Comment(1)

One more thing: read msdn.microsoft.com/en-us/library/aa664669(v=vs.71).aspx for a description of how >2-byte chars are represented in a string. – Felecia 6/2, 2012 at 15:44

MSDN covers it briefly here: http://msdn.microsoft.com/en-us/library/9b1s4yhz(v=vs.90).aspx

I tried this:

    static void Main(string[] args) {
        string someText = char.ConvertFromUtf32(0x1D7D9);
        using (var stream = new MemoryStream()) {
            using (var writer = new StreamWriter(stream, Encoding.UTF32)) {
                writer.Write(someText);
                writer.Flush();
            }
            var bytes = stream.ToArray();
            foreach (var oneByte in bytes) {
                Console.WriteLine(oneByte.ToString("x"));
            }
        }
    }

And got a dump of a byte array containing a correct BOM and the correct representation of the \u1D7D9 codepoint, for these encodings:

UTF8
UTF32
Unicode (UTF-16)

So my guess is that higher planes are supported, and that UTF-16 is really UTF-16 (and not UCS-2)

Biostatics answered 6/2, 2012 at 15:36 Comment(2)

Thanks for showing an easy approach. It seems indeed to be UTF-16 and not UCS-2 (anymore?). The character and all its encodings is here: fileformat.info/info/unicode/char/1d7d9/index.htm – Discontinuity 6/2, 2012 at 16:8

Btw, I read that reference but didn't find definitive information about what version was supported of Unicode. – Discontinuity 6/2, 2012 at 16:26

.NET Framework 4.6 and 4.5 and 4 and 3.5 and 3.0 - The Unicode Standard, version 5.0 .NET Framework 2.0 and 1.1 - The Unicode Standard, Version 3.1

The complete answers can be found here under the section Remarks.

Paynim answered 12/5, 2015 at 15:15 Comment(1)

See the edits I made to the original answer, it is not as what that MSDN page seems to suggest. In fact, that page only talks about the Unicode character categories, which is not the same in relation to character encoding or supported character ranges, but even those are different between version of the framework and the underlying operating system. See for more info the MSDN article on SortVersion (but be warned, even that page is not complete). – Discontinuity 12/5, 2015 at 23:34

Recommended topics

Hot tags