Assuming Unicode and case-insensitivity, should the pattern ".." match "FfIsS"?
Asked Answered
V

2

18

It sounds like a joke, but I can sort of prove it.

Assumptions:

  • Dot matches any single character.
  • A case-insensitive pattern matches s if and only if it matches s.toUpperCase().

All of the following is pretty logical and holds in Java:

  • "ffi".matches(".") LATIN SMALL LIGATURE FFI (U+FB03) is a character, so it must match
  • "ß".matches(".") LATIN SMALL LETTER SHARP S (U+00DF) is a character, so it must match
  • "ffi".toUpperCase().equals("FFI") by the Unicode standard (there's no capital ligature FFI)
  • "ß".toUpperCase().equals("SS") by the Unicode standard (there's a capital sharp S, but it doesn't get used)
  • "FfI".toUpperCase().equals("FFI") obviously
  • "sS".toUpperCase.equals("SS") obviously

So assuming the first dot in the regex stands for and the second for ß, the regex must match "FFISS" and because of the case insensivity also "FfIsS".

I really do hope there's something wrong, otherwise regexes would get pretty unusable.

The questions:

  • What's wrong with my "proof"?
  • What does exactly "case insensitive" mean if my second assumption doesn't hold?
Volva answered 2/10, 2013 at 10:57 Comment(15)
but you didn't prove that "ß".toUpperCase() matches to dot too :)Mutation
@Darka: That's why OP is questioning "What does exactly 'case insensitive' mean if my second assumption doesn't hold?". Clearly his second assumption does not hold, so I guess we can define case insensitivity as considering some predefined character sets to be equivalent to their "uppercase" counterpart. For example the character set [a-z] is equivalent as the character set [A-Z] if it's case insensitive. This is also true for some other character sets (say, accented letters), but apparently not in ligatures.Superstratum
Also relevant in the quirks of uppercasing: Turkish dotless and dotted iKauai
Your penultimate bullet point is false. Ffi in upper-case is FFI.Kavanagh
What does the specification say about case insensitivity of matches()?Stearn
@justhalf: Yes.... AFAIK initially it worked for the ASCII letters only, but then Pattern.UNICODE_CASE has been introduced.Volva
There you go, that should do you for a while. :)Spires
"ffi".matches(".") seems to me that this shouldn't match; just because it's a single code point doesn't mean it's a single character.Whose
@Whose Um, yes it does. Except in certain stratospheric compliance levels (like level 3), a . matches a single code point — not a code unit, but a code point. And in language which support the highly recommended notation (including ICU and Perl), a \X matches a grapheme. You aren’t going to come up with an idea of a “character” that isn’t one of those things. If you want to run it through NFKD first, then that might be more what you are thinking of.Spires
@Spires but I'm saying it shouldn't match a code point. It should match a character. For example the pattern f.i should match 'ffi'. Nor should this require explicit normalization. "You aren’t going to come up with an idea of a “character” that isn’t one of those things." I just did: 'ffi' is a ligature of three characters.Whose
@Spires It looks like lvl 2 unicode support specifies the correct thing (by normalizing text implicitly behind the scenes). Now are their any regex engines that provide this?Whose
@Whose No, behind-the-scenes normalization was stricken from the standard because it cannot be made to work. But you still are confused about character-vs-code-point. Those are not different terms.Spires
@Spires character and code point absolutely are not the same. A character can be made up of many code points (e.g., 'ą́') or a code point can be more than one character (e.g., 'ffi'). User Perceived CharacterWhose
@Whose Sigh. No. A code point cannot be more than one “character”, but a grapheme can certainly be more than one code point. Please stop confusing people with this character nonsense.Spires
@Spires I just gave you an example of a character that is not (and which in Unicode cannot be) a single code point, and an example of a code point that is three characters. I also even linked to Unicode's glossary for character.Whose
S
19

On Case Folding

The answer is no, dot will not match ss case insensitively, although the reasons are slightly esoteric.

However, your puzzle has often been raised by some of those most in the know about such things, because they too feel it leads to contradictions.

There are two forms of case mapping in Unicode. There is simple case mapping in which one code point only ever maps to exactly one other code point. So if length(s) == 1, then you are guaranteed that length(fc(s)) == 1 also, where fc is the Unicode foldcase map. But it also applies to uc, tc, and lc case mappings.

The problem with that it is that you do not get as good results analysing certain kinds of real-world text then you make those sorts of exact 1:1 length guarantees.

In fact, there are quite a few of these. The numbers indicate how many individual BMP code points map to the specified lengths under the four case maps:

length lc == 2          1
length lc == 3          0

length fc == 2         88
length fc == 3         16

length uc == 2         86
length uc == 3         16

length tc == 2         32
length tc == 3         16

In full casing, not the simple casing that Java’s regex uses, you can indeed get things like tschüß and TSCHÜSS to match, even though they are of unequal lengths. Perl and Ruby use full case mapping when doing case insensitive comparisons. This leads to strange paradoxes in negated character classes if you aren’t careful though.

But here’s the rub: case insensitive matching does not perform a transitive operation. In other words, if . matches ß and under case insensitive matching, ß matches SS, that does not mean that via transitivity . matches SS case insensitively. It just does not work that way, although smarter people than me have thought deeply upon the matter.

However, both these code points:

  • U+00DF ‭ ß LATIN SMALL LETTER SHARP S
  • U+1E9E ‭ ẞ LATIN CAPITAL LETTER SHARP S

do certainly case-insensitively match not only each other but also SS, Ss, sS, and ss under full case mapping. They just don’t do so under simple case mapping.

Unicode does make some guarantees about this. One is that if length(s) == n, that length(fn(s)) <= 3*n where fn is any of the four case maps: lc, fc, uc, and tc.

On Normalization

If you think that’s bad, it actually gets a good deal worse when you consider normalization forms. Here the guarantee is 5× not 3×. So length(NFx(s)) <= 5 * length(s), which as you see is getting expensive.

Here is the equivalent table showing how many code points expand to more than one under each of the four normalization forms:

length NFC  == 2        70
length NFC  == 3         2
length NFC  == 4         0
length NFC  == 5         0

length NFKC == 2       686
length NFKC == 3       404
length NFKC == 4        53
length NFKC == 5        15

length NFD  == 2       762
length NFD  == 3       220
length NFD  == 4        36
length NFD  == 5         0

length NFKD == 2      1345
length NFKD == 3       642
length NFKD == 4       109
length NFKD == 5        16

Isn’t that remarkable? For a while, Unicode wanted to try to build canonical equivalence into its pattern matching. They knew it was expensive for the reasons just stated, but it took them a while to figure out that it was fundamentally impossible due to the necessary canonical reordering of combining characters within one grapheme unit.

For this reason, and many others, the current recommendation if you want to compare things “case-insensitively” or “normalization-insensitively” is to yourself run it through the transform on both sides and then compared the results.

For example, given a suitable == code-point–by–code-point equivalence operator

fc(a) == fc(b)

and similarly for a =~ pattern matching operator (which works in the traditional way of course, not like Java’s broken match method that inappropriately anchors things):

fc(a) =~ fc(b)

The problem with that is that you can no longer turn case insensitivity on or off in particular parts of a pattern, such as

/aaa(?i:xxx)bbb/

and have only the xxx part be done case insensitively.

Full casing is hard, but it can (in most circumstances) be done, as Perl and Ruby have proven. But it is also rather non-intuitive (read: surprising) in places you should you understood. You have to do special things with bracketed character classes, especially with their negations, or it leads to nonsense.

Locale Matching

Finally, to make matters truly complicated, in the real world, you have to do more than either or both of case mapping and normalization. In certain national locales, things are more complicated. For example, in a German phonebook, and vowel with an umlaut counts exactly the same as that same base vowel followed by a the letter e. So there, something like müß would be expected to match MUESS case-insensitively.

To do all this right, you really need to tie in not just to the full case mapping and normalization tables, the DUCET itself, the Default Unicode Collation Element Table, and even the CLDR data (see Bibliography):

#!/usr/bin/perl
use utf8;
use open qw(:utf8 :std);
use Unicode::Collate::Locale;

my $Collator = Unicode::Collate::Locale->new(
    locale        => "de__phonebook",
    level         => 1,
    normalization => undef,
);

my $full = "Ich müß Perl studieren.";
my $sub  = "MUESS";
if (my ($pos,$len) = $Collator->index($full, $sub)) {
    my $match = substr($full, $pos, $len);
    print "Found match of literal ‹$sub› at position $pos in ‹$full› as ‹$match›\n";
}

If you run that, you will discover that it indeed works:

Found match of literal ‹MUESS› at position 4 in ‹Ich müß Perl studieren.› as ‹müß›


Selected Bibliography

Most of these examples were taken from the 4th edition of Programming Perl by kind permission of its author. :) I write quite a bit about such Unicode matters there, stuff that is not specific to Perl but general to Unicode overall.

The unichars(1) program that allows me to gather statistics like these:

$ unichars 'length fc == 2' | wc -l
      88

$ unichars 'length NFKD == 4' | wc -l
     109

$ unichars '/ss/i'
U+00DF ‭ ß  LATIN SMALL LETTER SHARP S
U+1E9E ‭ ẞ  LATIN CAPITAL LETTER SHARP S

Is part of the Unicode::Tussle CPAN module suite that Brian Foy has been kind enough to maintain for me.


For further reading

See also:

Spires answered 2/10, 2013 at 13:41 Comment(1)
Wow, impressive answer as always.Cordite
W
1

As maaartinus pointed out in his comment, Java provides (at least in theory) Unicode support for case-insensitive reg-exp matching. The wording in the Java API documentation is that matching is done "in a manner consistent with the Unicode Standard". The problem is however, that the Unicode standard defines different levels of support for case conversion and case-insensitive matching and the API documentation does not specify which level is supported by the Java language.

Although not documented, at least in Oracle's Java VM, the reg-exp implementation is limited to so called simple case-insensitive matching. The limiting factors relevant to your example data is that the matching algorithm only works as expected if the case folding (conversion) results in the same number of characters and that sets (e.g. ".") are limited to match exactly one character in the input string. The first limitation even leads to "ß" not matching "SS", as you also may have had expected.

To get support for full case-insensitive matching between string literals, you can use the reg-exp implementation in the ICU4J library, so that at least "ß" and "SS" matches. AFAIK, there are however no reg-exp implementations for Java with full support for groups, sets and wild cards.

Went answered 2/10, 2013 at 12:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.