string.IndexOf() not recognizing modified characters

C

2

8

When using IndexOf to find a char which is followed by a large valued char (e.g. char 700 which is ʼ) then the IndexOf fails to recognize the char you are looking for.

e.g.

string find = "abcʼabcabc";   
int index = find.IndexOf("c");

In this code, index should be 2, but it returns 6.

Is there a way to get around this?

Conservative answered 21/10, 2013 at 13:49 Comment(1)

This works int index = find.IndexOf("c", StringComparison.Ordinal); – Amey 21/10, 2013 at 13:54

S

7

The cʼ construct is being handled as linguistically different to the simple bytes. Use the Ordinal string comparison to force a byte comparison.

        string find = "abcʼabcabc";

        int index = find.IndexOf("c", StringComparison.Ordinal);

Strobel answered 21/10, 2013 at 13:53 Comment(5)

+1 For clarification, the Ordinal sort rule works because it performs a comparison based on the numeric value (Unicode code point) of each Char in the string - see the docs, exactly what the OP is asking for. – Hendley 21/10, 2013 at 13:56

Or just use find.IndexOf('c') instead of supplying a string. – Mariano 21/10, 2013 at 13:57

is StringComparison.Ordinal going to make the code much slower? – Conservative 21/10, 2013 at 14:0

No. But if you are worried then do tests. Premature optimisation is the root of some evil. – Strobel 21/10, 2013 at 14:1

@Conservative no, infact it should technically be faster. – Hendley 21/10, 2013 at 14:1

A

14

Unicode letter 700 is a modifier apostrophe: in other words, it modifies the letter c. In the same way, if you were to use an 'e' followed by character 769 (0x301), it would not really be an 'e' anymore: the e has been modified to be e with an acute accent. To wit: é. You'll see that letter is actually two characters: copy it to notepad and hit backspace (neat, huh?).

You need to do an "Ordinal" comparison (byte-by-byte) without any linguistic comparison. That will find the 'c', and ignore the linguistic fact that it is modified by the next letter. In my 'e' example, the bytes are (65)(769), so if you go byte-by-byte looking for 65, you will find it, and that ignores the fact that (65)(769) is linguistically the same as (233): é. If you search for (233) linguistically it will find the "equivalent" (65)(769):

string find = "abéabcabc";
int index = find.IndexOf("é"); //gives you '2' even though the "find" has two characters and the the "indexof" is one

Hopefully that's not too confusing. If you're doing this in real code you should explain in comments exactly what you're doing: as in my 'e' example generally you would want to do semantic equivalence for user data, and ordinal equivalence for e.g. constants (which hopefully wouldn't be different like this, lest your successor hunt you down with an axe).

Aquifer answered 21/10, 2013 at 13:56 Comment(2)

Or you can tell indexOf to look for characters: int index = find.IndexOf('c'); – Beker 21/10, 2013 at 13:59

Ah, good point. That'd be an evil trick question on an interview: what's the difference between IndexOf('c') and IndexOf("c")? – Aquifer 21/10, 2013 at 14:3

S

7