Could string comparisons really differ based on culture when the string is guaranteed not to change?
Asked Answered
D

3

54

I'm reading encrypted credentials/connection strings from a config file. Resharper tells me, "String.IndexOf(string) is culture-specific here" on this line:

if (line.Contains("host=")) {
    _host = line.Substring(line.IndexOf(
        "host=") + "host=".Length, line.Length - "host=".Length);

...and so wants to change it to:

if (line.Contains("host=")) {
    _host = line.Substring(line.IndexOf("host=", System.StringComparison.Ordinal) + "host=".Length, line.Length -   "host=".Length);

The value I'm reading will always be "host=" regardless of where the app may be deployed. Is it really sensible to add this "System.StringComparison.Ordinal" bit?

More importantly, could it hurt anything (to use it)?

Die answered 7/6, 2012 at 23:46 Comment(0)
T
66

Absolutely. Per MSDN (http://msdn.microsoft.com/en-us/library/d93tkzah.aspx),

This method performs a word (case-sensitive and culture-sensitive) search using the current culture.

So you may get different results if you run it under a different culture (via regional and language settings in Control Panel).

In this particular case, you probably won't have a problem, but throw an i in the search string and run it in Turkey and it will probably ruin your day.

See MSDN: http://msdn.microsoft.com/en-us/library/ms973919.aspx

These new recommendations and APIs exist to alleviate misguided assumptions about the behavior of default string APIs. The canonical example of bugs emerging where non-linguistic string data is interpreted linguistically is the "Turkish-I" problem.

For nearly all Latin alphabets, including U.S. English, the character i (\u0069) is the lowercase version of the character I (\u0049). This casing rule quickly becomes the default for someone programming in such a culture. However, in Turkish ("tr-TR"), there exists a capital "i with a dot," character (\u0130), which is the capital version of i. Similarly, in Turkish, there is a lowercase "i without a dot," or (\u0131), which capitalizes to I. This behavior occurs in the Azeri culture ("az") as well.

Therefore, assumptions normally made about capitalizing i or lowercasing I are not valid among all cultures. If the default overloads for string comparison routines are used, they will be subject to variance between cultures. For non-linguistic data, as in the following example, this can produce undesired results:

    Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US")
Console.WriteLine("Culture = {0}",
   Thread.CurrentThread.CurrentCulture.DisplayName);
Console.WriteLine("(file == FILE) = {0}", 
   (String.Compare("file", "FILE", true) == 0));

Thread.CurrentThread.CurrentCulture = new CultureInfo("tr-TR");
Console.WriteLine("Culture = {0}",
   Thread.CurrentThread.CurrentCulture.DisplayName);
Console.WriteLine("(file == FILE) = {0}", 
   (String.Compare("file", "FILE", true) == 0));

Because of the difference of the comparison of I, results of the comparisons change when the thread culture is changed. This is the output:

Culture = English (United States)
(file == FILE) = True
Culture = Turkish (Turkey)
(file == FILE) = False

Here is an example without case:

var s1 = "é"; //é as one character (ALT+0233)
var s2 = "é"; //'e', plus combining acute accent U+301 (two characters)

Console.WriteLine(s1.IndexOf(s2, StringComparison.Ordinal)); //-1
Console.WriteLine(s1.IndexOf(s2, StringComparison.InvariantCulture)); //0
Console.WriteLine(s1.IndexOf(s2, StringComparison.CurrentCulture)); //0
Tavia answered 8/6, 2012 at 0:3 Comment(3)
why does IndexOf has anything to do with case, microsoft is mixing up everything in the usual bloated way that they love. Their mistake is to always suppose the most complex first here, and let us choose the low level way with a hugely verbose fashion.Secant
Fine, forget about case. There are other examples if you go outside of English. For example e + combining accent, vs. é. They are different in ordinal, but the same linguistically (see edit). Guess what, language is hard.Tavia
Awesome explanation and great examples.Savick
L
27

CA1309: UseOrdinalStringComparison

It doesn't hurt to not use it, but "by explicitly setting the parameter to either the StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase, your code often gains speed, increases correctness, and becomes more reliable.".


What exactly is Ordinal, and why does it matter to your case?

An operation that uses ordinal sort rules performs a comparison based on the numeric value (Unicode code point) of each Char in the string. An ordinal comparison is fast but culture-insensitive. When you use ordinal sort rules to sort strings that start with Unicode characters (U+), the string U+xxxx comes before the string U+yyyy if the value of xxxx is numerically less than yyyy.

And, as you stated... the string value you are reading in is not culture sensitive, so it makes sense to use an Ordinal comparison as opposed to a Word comparison. Just remember, Ordinal means "this isn't culture sensitive".

Lorsung answered 7/6, 2012 at 23:51 Comment(0)
A
7

To answer your specific question: No, but a static analysis tool is not going to be able to realize that your input value will never have locale-specific information in it.

Aneroid answered 7/6, 2012 at 23:55 Comment(1)
Moreover, sometimes I don't realize it until later... :-)Pereira

© 2022 - 2024 — McMap. All rights reserved.