Why isn't string.Normalize consistent depending on the context?

About

Asked 10/5, 2012 at 7:52 Answered 14/5, 2012 at 11:51

Solved c#.net unicode normalization unicode-normalization

I have the following code:

string input = "ç";
string normalized = input.Normalize(NormalizationForm.FormD);
char[] chars = normalized.ToCharArray();

I build this code with Visual studio 2010, .net4, on a 64 bits windows 7.

I run it in a unit tests project (platform: Any CPU) in two contexts and check the content of chars:

Visual Studio unit tests : chars contains { 231 }.
ReSharper : chars contains { 231 }.
NCrunch : chars contains { 99, 807 }.

In the msdn documentation, I could not find any information presenting different behaviors.

So, why do I get different behaviors? For me the NCrunch behavior is the expected one, but I would expect the same for others.

Edit: I switched back to .Net 3.5 and still have the same issue.

Kulseth answered 10/5, 2012 at 7:52 Comment(10)

Hmm, I get { 99, 807 } with Visual Studio... This would imply there is something about the configuration of your project... Maybe. – Dram 10/5, 2012 at 8:11

@zmilojko. Thanks for your testing. I get the same results as yours in a blank new project. So I am checking the differences between the two projects (winmerge on csproj), but could not find relevant yet, which was the reason for me posting this question: understand which context could induce a different behavior. – Kulseth 10/5, 2012 at 10:15

What is Thread.CurrentThread.CurrentCulture in each case? – Forlini 11/5, 2012 at 12:48

How do you 'check the content of chars'? – Jeopardous 11/5, 2012 at 12:59

@AakashM, In all cases, Thread.CurrentThread.CurrentCulture is fr-FR. I also checked Thread.CurrentThread.CurrentUICulture which is en-US in all cases. – Kulseth 11/5, 2012 at 13:3

@MattHickford, I gently move my mouse over the chars variable in the debugger, then unfold the + sign. – Kulseth 11/5, 2012 at 13:5

@AakashM, I used the ç character in my example, but I get the same behavior with all of the french accentuated characters I have tested. – Kulseth 11/5, 2012 at 13:36

If I had to guess I'd say something strange is going on with the build configurations, causing an old version of the code to be run in by resharper and visual studio, but one that ncrunch ignores. For example, a library set to build the any configuration, but the GUI set to x86. – Octans 12/5, 2012 at 7:51

@PhilMartin, I am also suspicious about something like that. So, I cleaned it all (hopefully), rebuilt it, also tried it on another computer. Several times. Same result. – Kulseth 12/5, 2012 at 8:10

@PhilMartin, However, I would be really interested in understanding which parameter make string.Normalize behave differently. – Kulseth 12/5, 2012 at 8:17

In String.Normalize(NormalizationForm) documentation it says that

binary representation is in the normalization form specified by the normalizationForm parameter.

which means you'd be using FormD normalization on both cases, so CurrentCulture and such should not really matter.

The only thing that could change, then, what I can think of is the "ç" character. That character is interpreted as per character encoding that is either assumed or configured for Visual Studio source code files. In short, I think NCrunch is assuming different source file encoding than the others.

Based on quick searching on NCrunch forum, there was a mention of some UTF-8 -> UTF-16 conversion, so I would check that.

Eboat answered 14/5, 2012 at 11:51 Comment(1)

Indeed, I was strongly suspecting the encoding of te ç character in the source / runtime code. I started playing with the encoding of the source file with no luck. Then, I tried to read the string from an external file, which failed until I forced its encoding to UTF-8. Finally, I updated my declaration of input to string input = new string(new[] { (char)231 });, and... it works! – Kulseth 14/5, 2012 at 13:13

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags