Strange results from IndexOf on German string
Asked Answered
D

2

9

I have string "Ärger,-Ökonom-i-Übermut-ẞ-ß" and when I run IndexOf("--") I get a result of 23. If I use Replace on same string nothing gets replaced.

I don't understand what is happening, so can someone please shed some light on this issue? Application Culture is set on Croatian, it's not German, and framework version is 3.5.

Changing culture to German (de-DE) doesn't change this strange behavior.

Here is the screenshot from the debugger:

enter image description here

Dorcus answered 13/2, 2012 at 12:40 Comment(13)
Is it correct that there is no "--" in the String?Quetzalcoatl
I would say so, somehow IndexOf is treating ẞ as a -, exactly that is the problemDorcus
Sounds like a bug to me. I can reproduce the issue with .Net 3.5, but it returns -1 as expected with .Net 4.0.Hummingbird
What happens if you explicitly set the culture info to de-de?Bield
@DennisTraub Doesn't fix the problem on my machine (.net 3.5).Hummingbird
I updated my question with info that changing thread culture to German doesn't fix the issueDorcus
I'm afraid that U+1E9E is undefined according to .NET 3.5, because this character didn't exist in Unicode 4.0 (or whatever version of Unicode .NET 3.5 uses). It's a fairly new addition (uppercase version of German ß). So the IndexOf function ignores it. If you have any control over the text, you could change the character to ß or SS, whatever is more appropriate. Of course the better solution is to upgrade .NET to v4.0!Fiat
@Mr Lister, OK, so maybe this is not a bug. I guess it depends from one point of view :) Please write answer so I can accept it.Dorcus
But LukeH already gave at least half the answer. You can also accept his.Fiat
Well, I really think that your comment clarified this issue, important thing here is that U+1E9E undefined in .NET 3.5Dorcus
@MrLister I think the OP is right, you should write your comment as an answer so the OP can accept it.Hummingbird
I pasted Mr Lister comment and accept it, also marked as Community WikiDorcus
Tag german removed as part of the 2012 cleanup.Setsukosett
D
3

Since Mr Lister doesn't want his well deserved upvotes, I will paste his comment here, and accept answer.

I'm afraid that U+1E9E is undefined according to .NET 3.5, because this character didn't exist in Unicode 4.0 (or whatever version of Unicode .NET 3.5 uses). It's a fairly new addition (uppercase version of German ß). So the IndexOf function ignores it. If you have any control over the text, you could change the character to ß or SS, whatever is more appropriate. Of course the better solution is to upgrade .NET to v4.0!

Dorcus answered 13/2, 2012 at 12:40 Comment(0)
C
2

IndexOf uses the current culture if you don't tell it otherwise:

This method performs a word (case-sensitive and culture-sensitive) search using the current culture.

Replace uses an ordinal comparison:

This method performs an ordinal (case-sensitive and culture-insensitive) search to find oldValue.

Calder answered 13/2, 2012 at 13:11 Comment(8)
Is there something that changed in this aspect between .NET 3.5 and .NET 4.0? Because the code works as expected in .NET 4.0.Rebak
@Darin: Not sure - that behaviour has been documented for as long as I can remember. I'm doing some tests now, but I can't replicate the OP's results in .NET4 either.Calder
Yes, but in .NET 3.5 the behavior can be reproduced.Rebak
The string functions haven't changed, but the character classification tables were updated, so U+1E9E is defined now.Fiat
@MrLister, very interesting. Could definitely lead to some very subtle bugs.Rebak
@Mr Lister: I think you've hit the nail on the head there. Why not make it into an answer so that we can give you our upvotes?Calder
@DarinDimitrov Sure, but only if you use those new characters in your text. And they are very, very rare!Fiat
indeed it's a rare situation, and for the record we didn't get this by accident from our users, it was a test for UrlSanatize method, and we test all letters for number of european languages. So this uppercase ß was copied from wikipedia page of German language, as I understand it's not widely used.Dorcus

© 2022 - 2024 — McMap. All rights reserved.