strange string.IndexOf behavour
Asked Answered
O

1

7

I wrote the following snippet to get rid of excessive spaces in slabs of text

int index = text.IndexOf("  ");
while (index > 0)
{
    text = text.Replace("  ", " ");
    index = text.IndexOf("  ");
}

Generally this works fine, albeit rather primative and possibly inefficient.

Problem

When the text contains " - " for some bizzare reason the indexOf returns a match! The Replace function doesn't remove anything and then it is stuck in a endless loop.

Any ideas what is going on with the string.IndexOf?

Ommatophore answered 4/2, 2011 at 0:5 Comment(3)
Regex alternative replace function nlakkakula.wordpress.com/2008/09/16/…Ommatophore
I tried this and it seems so be working, with 2 and 3 spaces and hyphens. This is my string string text = "A B C - More Stuff - , hey look Working"; Can you post your string?Lorraine
Posting the string here is likely to not work, since SO will replace the problematic character with a more common one. Try this. Open CharMap, find the Soft Hyphen character (it is located right next to the R-in-circle for Registered Trademark character), copy, and then paste that into your code and then try it.Erratum
E
22

Ah, the joys of text.

What you most likely have there, but got lost when posting on SO, is a "soft hyphen".

To reproduce the problem, I tried this code in LINQPad:

void Main()
{
    var text = "Test1 \u00ad Test2";
    int index = text.IndexOf("  ");
    while (index > 0)
    {
        text = text.Replace("  ", " ");
        index = text.IndexOf("  ");
    }
}

And sure enough, the above code just gets stuck in a loop.

Note that \u00ad is the Unicode symbol for Soft Hyphen, according to CharMap. You can always copy and paste the character from CharMap as well, but posting it here on SO will replace it with its much more common cousin, the Hyphen-Minus, Unicode symbol u002d (the one on your keyboard.)

You can read a small section in the documentation for the String Class which has this to say on the subject:

String search methods, such as String.StartsWith and String.IndexOf, also can perform culture-sensitive or ordinal string comparisons. The following example illustrates the differences between ordinal and culture-sensitive comparisons using the IndexOf method. A culture-sensitive search in which the current culture is English (United States) considers the substring "oe" to match the ligature "œ". Because a soft hyphen (U+00AD) is a zero-width character, the search treats the soft hyphen as equivalent to Empty and finds a match at the beginning of the string. An ordinal search, on the other hand, does not find a match in either case.

I've highlighted the relevant part, but I also remember a blog post about this exact problem a while back but my Google-Fu is failing me tonight.

The problem here is that IndexOf and Replace use different methods for locating the text.

Whereas IndexOf will consider the soft hyphen as "not really there", and thus discover the two spaces on each side of it as "two joined spaces", the Replace method won't, and thus won't remove either of them. Therefore the criteria is present for the loop to continue iterating, but since Replace doesn't remove the spaces that fit the criteria, it will never end. Undoubtedly there are other such characters in the Unicode symbol space that exhibit similar problems, but this is the most typical case I've seen.

There's at least two ways of handling this:

  1. You can use Regex.Replace, which seems to not have this problem:

    text = Regex.Replace(text, "  +", " ");
    

    Personally I would probably use the whitespace special character in the Regular Expression, which is \s, but if you only want spaces, the above should do the trick.

  2. You can explicitly ask IndexOf to use an ordinal comparison, which won't get tripped up by text behaving like ... well ... text:

    index = text.IndexOf("  ", StringComparison.Ordinal);
    
Erratum answered 4/2, 2011 at 0:11 Comment(4)
woah! I learned something today. Are there other situations where IndexOf behavior is different from the ordinal comparison?Sprage
Well, anywhere you can specify a CultureInfo object, it always pays off to observe what the default is if you use an overload which doesn't take that parameter, and text is just one of those areas where the payoff is usually in terms of working-nonworking.Erratum
soft hyphen? wow what a gotcha - thanks for the great answer!Ommatophore
Note that a soft hyphen is usually visible in source code, but should usually not be present in displayed text. Its purpose is to mark a place where a work can be broken. You could add it inside a long word at the points where breaking the word is allowed, knowing that displaying that word won't carry along with it lots of small "minus signs."Erratum

© 2022 - 2024 — McMap. All rights reserved.