Why isn't string.Normalize consistent depending on the context?
Asked Answered
K

1

16

I have the following code:

string input = "ç";
string normalized = input.Normalize(NormalizationForm.FormD);
char[] chars = normalized.ToCharArray();

I build this code with Visual studio 2010, .net4, on a 64 bits windows 7.

I run it in a unit tests project (platform: Any CPU) in two contexts and check the content of chars:

  • Visual Studio unit tests : chars contains { 231 }.
  • ReSharper : chars contains { 231 }.
  • NCrunch : chars contains { 99, 807 }.

In the msdn documentation, I could not find any information presenting different behaviors.

So, why do I get different behaviors? For me the NCrunch behavior is the expected one, but I would expect the same for others.

Edit: I switched back to .Net 3.5 and still have the same issue.

Kulseth answered 10/5, 2012 at 7:52 Comment(10)
Hmm, I get { 99, 807 } with Visual Studio... This would imply there is something about the configuration of your project... Maybe.Dram
@zmilojko. Thanks for your testing. I get the same results as yours in a blank new project. So I am checking the differences between the two projects (winmerge on csproj), but could not find relevant yet, which was the reason for me posting this question: understand which context could induce a different behavior.Kulseth
What is Thread.CurrentThread.CurrentCulture in each case?Forlini
How do you 'check the content of chars'?Jeopardous
@AakashM, In all cases, Thread.CurrentThread.CurrentCulture is fr-FR. I also checked Thread.CurrentThread.CurrentUICulture which is en-US in all cases.Kulseth
@MattHickford, I gently move my mouse over the chars variable in the debugger, then unfold the + sign.Kulseth
@AakashM, I used the ç character in my example, but I get the same behavior with all of the french accentuated characters I have tested.Kulseth
If I had to guess I'd say something strange is going on with the build configurations, causing an old version of the code to be run in by resharper and visual studio, but one that ncrunch ignores. For example, a library set to build the any configuration, but the GUI set to x86.Octans
@PhilMartin, I am also suspicious about something like that. So, I cleaned it all (hopefully), rebuilt it, also tried it on another computer. Several times. Same result.Kulseth
@PhilMartin, However, I would be really interested in understanding which parameter make string.Normalize behave differently.Kulseth
E
7

In String.Normalize(NormalizationForm) documentation it says that

binary representation is in the normalization form specified by the normalizationForm parameter.

which means you'd be using FormD normalization on both cases, so CurrentCulture and such should not really matter.

The only thing that could change, then, what I can think of is the "ç" character. That character is interpreted as per character encoding that is either assumed or configured for Visual Studio source code files. In short, I think NCrunch is assuming different source file encoding than the others.

Based on quick searching on NCrunch forum, there was a mention of some UTF-8 -> UTF-16 conversion, so I would check that.

Eboat answered 14/5, 2012 at 11:51 Comment(1)
Indeed, I was strongly suspecting the encoding of te ç character in the source / runtime code. I started playing with the encoding of the source file with no luck. Then, I tried to read the string from an external file, which failed until I forced its encoding to UTF-8. Finally, I updated my declaration of input to string input = new string(new[] { (char)231 });, and... it works!Kulseth

© 2022 - 2024 — McMap. All rights reserved.