IndexOf matching when Unicode 0xFFFD is in the string - bug or feature?
Asked Answered
H

1

5

In VS2012's C# the following code:

string test = "[ " + (char)0xFFFD + " ]";
System.Console.WriteLine("{0}", test.IndexOf("  ") == 1);

results in a

True

printed to console output window. The spaces are separated by 0xFFFD yet it matches two consecutive spaces. Is that an expected result/feature or a (known) bug?

Histaminase answered 20/5, 2014 at 22:15 Comment(3)
Is it only unexpected because there are two spaces in your comparison string?Postobit
You may want to use visible characters other than spaces for the demo.Snick
Simplifying what you have, your question is why "[ \uFFFD ]" contains " "Ballot
B
7

It's an expected result. FFFD is a "replacement character" in Unicode and is not meaningful in any culture. IndexOf ignores any non-meaningful characters in its search:

Character sets include ignorable characters, which are characters that are not considered when performing a linguistic or culture-sensitive comparison.

Bat answered 20/5, 2014 at 22:17 Comment(3)
I read that before posting my comment, but wasn't sure if FFFD was ignorable. Thanks for the great info!Postobit
If it were in the ignorable set, wouldn't that be indicated here ?Ballot
@D Stanley: Thank you, I should have read C# manual more carefully. "Strange behavior" of IndexOf(...) costed me (quite) some time isolating and fixing (replacing with RegEx's Replace) random endless loop as the following: while (test.IndexOf(" ")>=0) test = test.Replace(" "," ");Histaminase

© 2022 - 2024 — McMap. All rights reserved.