Why does a space preceding a non-combining diacritic function differently when using IndexOf(string) and IndexOf(char)?
Asked Answered
V

1

7

I am creating a substring from a string with non-combining diacritics that follow a space. When doing so, I check the string with .Contains() and then perform the substring. When I use a space char inside of an .IndexOf(), the program performs as expected, yet when using the string " ", within .IndexOf() the program throws an exception. As shown in the samples below only a string preceding the primary stress diacritic (U+02C8) throws an ArgumentOutOfRangeException.

Simple code (Edit suggested by John):

string a = "aɪ prɪˈzɛnt";
string b = "maɪ ˈprɛznt";

// A            
Console.WriteLine(a.IndexOf(" ")); // string index:  2
Console.WriteLine(a.IndexOf(' ')); // char index:    2

// B    
Console.WriteLine(b.IndexOf(" ")); // string index: -1
Console.WriteLine(b.IndexOf(' ')); // char index:    3

Sample code I tested with:

        const string iPresent = "aɪ prɪˈzɛnt",
                     myPresent = "maɪ ˈprɛznt";

        if(iPresent.Contains(' '))
        {
            Console.WriteLine(iPresent.Substring(0, iPresent.IndexOf(' ')));
        }

        if(iPresent.Contains(" "[0]))
        {
            Console.WriteLine(iPresent.Substring(0, iPresent.IndexOf(" "[0])));
        }

        if(iPresent.Contains(" "))
        {
            Console.WriteLine(iPresent.Substring(0, iPresent.IndexOf(" ")));
        }

        if(iPresent.Contains(string.Empty + ' '))
        {
            Console.WriteLine(iPresent.Substring(0, iPresent.IndexOf(string.Empty + ' ')));
        }

        if (myPresent.Contains(' '))
        {
            Console.WriteLine(myPresent.Substring(0, myPresent.IndexOf(' ')));
        }

        if (myPresent.Contains(" "[0]))
        {
            Console.WriteLine(myPresent.Substring(0, myPresent.IndexOf(" "[0])));
        }

        if (myPresent.Contains(string.Empty + ' '))
        {
            try
            {
                Console.WriteLine(myPresent.Substring(0, myPresent.IndexOf(string.Empty + ' ')));
            }
            catch (Exception ex)
            {
                Console.WriteLine("***" + ex.Message);
            }
        }

        if (myPresent.Contains(" "))
        {
            try
            {
                Console.WriteLine(myPresent.Substring(0, myPresent.IndexOf(" ")));
            }
            catch (Exception ex)
            {
                Console.WriteLine("***" + ex.Message);
            }
        }
Vivavivace answered 30/6, 2020 at 23:39 Comment(6)
Why not provide an example like this? It seems much simpler to digest than all of the code you've provided.Uterus
Which line throws the exception? I can't reproduce the problem as you describe it, and judging by the exception I'd say the problem is not where you think it is.Semanteme
I guess I was too deep in the weeds think about showing just the index. I guess I was too wrapped up in the substring. I also wanted to show the different things I tested.Vivavivace
@Semanteme The only two lines within the try catch. Check the Simple code edit that John suggested.Vivavivace
Not sure why this happens, but passing StringComparison.Ordinal to failing IndexOf's fixes it)Rokach
Just tried it on .NET Core, cannot reproduce, so not a culture thing, possibly a bug (?)Wits
W
7

IndexOf(string) does something different from IndexOf(char), because IndexOf(char)...

...performs an ordinal (culture-insensitive) search, where a character is considered equivalent to another character only if their Unicode scalar values are the same.

whereas IndexOf(string)...

performs a word (case-sensitive and culture-sensitive) search using the current culture.

So it's a whole lot "smarter" than IndexOf(char) because it takes into account the string comparison rules of the current culture. This is why it doesn't find the space character.

After some testing in other languages and platforms, I suspect this is a bug of .NET Framework. Because in .NET Core 3.1, b.IndexOf(" ") doesn't return -1... Neither does b.IndexOf(' ', StringComparison.CurrentCulture). Other languages/platforms where "maɪ ˈprɛznt" contains a space culture-sensitively include:

  • Mono 6
  • Swift 5

Passing in StringComparison.Ordinal works:

b.IndexOf(" ", StringComparison.Ordinal)

But do note that you lose the smartness of culture-sensitive comparison.

Wits answered 1/7, 2020 at 0:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.