IndexOf and ordinal string comparisons
Asked Answered
S

2

10

My problem is that String.IndexOf returns -1. I would expect it to return 0.

The parameters:

text = C:\\Users\\User\\Desktop\\Sync\\̼ (note the Combining Seagull Below character)

stringToTrim = C:\\Users\\User\\Desktop\\Sync\\

When I check for the index, using int index = text.IndexOf(stringToTrim);, the value of index is -1. I found that using an ordinal string comparison solved this problem of mine:

int index = text.IndexOf(stringToTrim, StringComparison.Ordinal);

Reading online, a lot of Unicode characters (like U+00B5 and U+03BC) map to the same symbol, so it would be a good idea to expand on this and normalize both strings:

int index = text.Normalize(NormalizationForm.FormKD).IndexOf(stringToTrim.Normalize(NormalizationForm.FormKD), StringComparison.Ordinal);

Is this the correct approach to check at what index one string contains all sequential characters of another string? So the idea is, you normalize when you want to check that symbols are a match, but you don't normalize when you want to check characters by their encoded values (allow duplicate symbols, therefore)? Also, could someone please explain why int index = text.IndexOf(stringToTrim); did not find a match at the start of the string? In other words, what is it actually doing under the covers? I would have expected it to start searching characters from the beginning of the string to the end of the string.

Schoolfellow answered 15/12, 2014 at 20:32 Comment(7)
I copied / pasted this into LinqPad and got "0" back - maybe I don't understand combining characters.Cacoepy
@Cacoepy Try this: "C:\\Users\\User\\Desktop\\Sync\\̼".IndexOf("C:\\Users\\User\\Desktop\\Sync\\"); Make sure to copy this text entirely/exactly from here!Schoolfellow
(Thanks that worked.) Then I surely agree with the top rater answer below: either combining characters change the previous character (by combining) or you've found a weird bug that at least Microsoft warned you about.Cacoepy
@Cacoepy You may also find this one character interesting: unicode-table.com/en/search/?q=U%2B202E (right to left override, make no mistake, if you highlight over what is shown as blank and paste this character somewhere, and start typing, characters type to the left instead of to the right, so "like this" would become "siht ekil" as you type it out.Schoolfellow
@Cacoepy There's also various ways of exploiting this character, but its a bit off-topic to my question, still something I love showing people: krebsonsecurity.com/2011/09/…Schoolfellow
Some characters don't play nice. ̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼̼Schoolfellow
Which is why its a good idea to whitelist Unicode ranges (PS Unicode is a moving target so don't blacklist it): jrgraphix.net/research/unicode_blocks.phpSchoolfellow
T
6

The behavior makes perfect sense to me. You are using a combining character, which is combined with the preceding character, turning it into a different character, one which won't match the '\\' character you've specified at the end of your search string. That prevents the entire string you're looking for from being found. If you looked for "C:\\Users\\User\\Desktop\\Sync" instead, it would have found it.

Using StringComparison.Ordinal tells .NET to ignore the various rules for characters and look only at their exact ordinal value. This seems to do what you wanted, so yes…that's what you should do.

The "correct approach" depends entirely on what behavior you want. A lot of string manipulation involves text being presented to or provided by the user and should be done in a culture-aware and Unicode-aware way. Other times, that isn't desirable. It's important to select the right approach for your needs.

Totalizer answered 15/12, 2014 at 20:48 Comment(0)
E
1

Yes, you should use StringComparison.Ordinal to guarantee the culture is ignored when comparing the value. It is necessary especially for all the strings that are consider to be culture invariant "by default". That includes file paths.

When not using StringComparison.Ordinal) it is possible to introduce subtle bugs: http://msdn.microsoft.com/en-us/library/dd465121(v=vs.110).aspx

When culturally independent string data, such as XML tags, HTML tags, user names, file paths, and the names of system objects, are interpreted as if they were culture-sensitive, application code can be subject to subtle bugs, poor performance, and, in some cases, security issues.

Some side benefit of StringComparison.Ordinal is better performance: http://msdn.microsoft.com/en-us/library/ms973919.aspx

Enameling answered 15/12, 2014 at 20:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.