What version of Unicode is supported by which .NET platform and on which version of Windows in regards to character classes?
Asked Answered
D

4

28

Updated question ¹

With regards to character classes, comparison, sorting, normalization and collations, what Unicode version or versions are supported by which .NET platforms?

Original question

I remember somewhat vaguely having read that .NET supported Unicode version 3.0 and that the internal UTF-16 encoding is not really UTF-16 but actually uses UCS-2, which is not the same. It seems, for instance, that characters above U+FFFF are not possible, i.e. consider:

string s = "\u1D7D9"; // ("Mathematical double-struck digit one") 

and it stores the string "ᵽ9".

I'm basically looking for definitive references of answers to the following:

  • If it isn't true UTF-16 in .NET, what is it?
  • What version of Unicode is supported by .NET?
  • If recent versions are not supported or planned in the near future, does anybody know of a (non)commercial library or how I can workaround this issue?

¹) I updated the question as with passing time, it seems more appropriate with respect to the answers and to the larger community. I left the original question in place of which parts have been answered in the comments. Also the old UCS-2 (no surrogates) was used in now-ancient 32 bit Windows versions, .NET has always used UTF-16 (with surrogates) internally.

Discontinuity answered 6/2, 2012 at 15:4 Comment(6)
What exactly are you trying to do with those characters? Put them in a webpage with ASP.NET? Display them in a WPF or WinForms interface?Felecia
What does "it doesn't seem to work" mean in this context?Latticed
@JoeStrommen: we're implementing a new XML-based data transformation toolset, and I'm trying to found out whether I can say "we support Unicode up to 6.0" or whether we should say something else. In addition, I'm trying to find out how we could bypass possible limitations in .NET.Discontinuity
@Gabe: I updated my question, hopefully it's clearer now.Discontinuity
Oh, it looks like you were just using the wrong escape mechanism in C# -- it has nothing to do with .NET. Your string was interpreted as "\u1D7D" + "9". You just need "\U0001D7D9".Latticed
@Gabe: indeed, I wasn't aware of \U (never needed it before I guess) and then wrongly concluded that there was no support for higher planes.Discontinuity
C
20

Internally, .NET is UTF-16. In some cases, e.g. when ASP.NET writes to a response, by default it uses UTF-8. Both of them can handle higher planes.

The reason people sometimes refer to .NET as UCS2 is (I think, because I see few other reasons) that Char is strictly 16 bit and a single Char can't be used to represent the upper planes. Char does, however, have static method overloads (e.g. Char.IsLetter) that can operate on high plane UTF-16 characters inside a string. Strings are stored as true UTF-16.

You can address high Unicode codepoints directly using uppercase \U - e.g. "\U0001D7D9" - but again, only inside strings, not chars.

As for Unicode version, from the MSDN documentation:

"In the .NET Framework 4, sorting, casing, normalization, and Unicode character information is synchronized with Windows 7 and conforms to the Unicode 5.1 standard."

Update 1: It's worth noting, however, that this does not imply that the entirety of Unicode 5.1 is supported - neither in Windows 7 nor in .NET 4.0

Windows 8 targets Unicode 6.0 - I'm guessing that .NET Framework 4.5 might synchronize with that, but have found no sources confirming it. And once again, that doesn't mean the entire standard is implemented.

Update 2: This note on Roslyn confirms that the underlying platform defines the Unicode support for the compiler, and in the link to the code it explains that C# 6.0 supports Unicode 6.0 and up (with a breaking change for C# identifiers as a result).

Update 3: Since .NET version 4.5 a new class SortVersion is introduced to get the supported Unicode version by calling the static property SortVersion.FullVersion. On the same page, Microsoft explains that .NET 4.0 supports Unicode 5.0 on all platforms and .NET 4.5 supports Unicode 5.0 on Windows 7 and Unicode 6.0 on Windows 8. This slightly contrasts the official "what is new" statement here, which talks of version 5.x and 6.0 respectively. From my own (editor: Abel) experience, in most cases it seems that in .NET 4.0, Unicode 5.1 is supported at least for character classes, but I didn't test sorting, normalization and collations. This seems in line with what is said in MSDN as quoted above.

Cooperative answered 6/2, 2012 at 15:49 Comment(6)
Good observation about char. I notice indeed that char uni = "\U0002B740".ToCharArray()[0]; shows "55405", which is only one half of the UTF-16 surrogate pair. It follows from your reference that trying Char.IsLetter on \u0526 (incorrectly) shows false, because it was only introduced with Unicode 6.Discontinuity
(accepting this because you showed the reference I was looking for and too stupid to find at is obvious location, however, the other answers are valuable in their own right)Discontinuity
This might be a helpful point of origin for getting information for single characters: MSDN link. Since char cannot contain more than one half, the StringInfo methods return a string instead, with the complete UTF-16 pair (if the character is a pair - otherwise it just returns the single char - as a string, or character + combining characters for combining diacritics).Cooperative
This makes much more sense now. The C# Language Spec considers char an unsigned 16-bit integral type. So it would seem it that it was designed to have a fixed-width, which would explain its lack of support for UTF-16 surrogates.Prent
"Since .NET version 4.5 a new class SortVersion is introduced to get the supported Unicode version by calling the static property SortVersion.FullVersion" -- SortVersion.FullVersion isn't staticFilia
For the mapping from .NET platforms to unicode standards see learn.microsoft.com/en-us/dotnet/api/…Louisville
F
5

That character is supported. One thing to note is that for unicode characters with more than 2 bytes, you must declare them with an uppercase '\U', like this:

string text = "\U0001D7D9"

If you create a WPF app with that character in a text block, it should render the double-one character perfectly.

Felecia answered 6/2, 2012 at 15:42 Comment(1)
One more thing: read msdn.microsoft.com/en-us/library/aa664669(v=vs.71).aspx for a description of how >2-byte chars are represented in a string.Felecia
B
4

MSDN covers it briefly here: http://msdn.microsoft.com/en-us/library/9b1s4yhz(v=vs.90).aspx

I tried this:

    static void Main(string[] args) {
        string someText = char.ConvertFromUtf32(0x1D7D9);
        using (var stream = new MemoryStream()) {
            using (var writer = new StreamWriter(stream, Encoding.UTF32)) {
                writer.Write(someText);
                writer.Flush();
            }
            var bytes = stream.ToArray();
            foreach (var oneByte in bytes) {
                Console.WriteLine(oneByte.ToString("x"));
            }
        }
    }

And got a dump of a byte array containing a correct BOM and the correct representation of the \u1D7D9 codepoint, for these encodings:

  • UTF8
  • UTF32
  • Unicode (UTF-16)

So my guess is that higher planes are supported, and that UTF-16 is really UTF-16 (and not UCS-2)

Biostatics answered 6/2, 2012 at 15:36 Comment(2)
Thanks for showing an easy approach. It seems indeed to be UTF-16 and not UCS-2 (anymore?). The character and all its encodings is here: fileformat.info/info/unicode/char/1d7d9/index.htmDiscontinuity
Btw, I read that reference but didn't find definitive information about what version was supported of Unicode.Discontinuity
P
0

.NET Framework 4.6 and 4.5 and 4 and 3.5 and 3.0 - The Unicode Standard, version 5.0 .NET Framework 2.0 and 1.1 - The Unicode Standard, Version 3.1

The complete answers can be found here under the section Remarks.

Paynim answered 12/5, 2015 at 15:15 Comment(1)
See the edits I made to the original answer, it is not as what that MSDN page seems to suggest. In fact, that page only talks about the Unicode character categories, which is not the same in relation to character encoding or supported character ranges, but even those are different between version of the framework and the underlying operating system. See for more info the MSDN article on SortVersion (but be warned, even that page is not complete).Discontinuity

© 2022 - 2024 — McMap. All rights reserved.