Why is upper casing not enough for case-insensitive comparison?
Asked Answered
G

2

6

To compare two strings case insensitively, one correct way is to case fold them first. How is this better than upper casing or lower casing?

I find examples where lower casing doesn't work right online. For example "σ" and "ς" (two forms of "Σ") don't become the same when converted to lower case. But I've failed to find why case folding is better than mapping to upper case. Is there a case where two strings that should match case insensitively don't upper case to the same strings?

Another scenario is when I want to store a case insensitive index. The recommended way seems to be case folding and then normalizing. What are its advantages over storing the string mapped to upper case and normalized? The specs say mapping to upper case is not guaranteed to be stable across versions of Unicode while case folding is. But are there any cases where mapping to upper case gives a different string in an earlier version of Unicode?

Grave answered 15/4, 2021 at 10:29 Comment(2)
Your example works only for Greek. On other languages you may have the contrary case. But in Unicode you should consider equivalent the single character with accent, and the base character + the combining accent character. [And you have also many more special cases]. In short: if you need to compare strings, you should normalize and then ev. putting them in a common case. And BTW: it is language dependent. So maybe better to use Unicode libraries, instead of checking all possible cases.Nonmaterial
Unicode is compatible with any older version (you may have new fields, and new recommended algorithms) [and but version 1.0]. This by design. But new characters are added, so older version may just skip unknown characters, and new version may see that there is an upper caseNonmaterial
M
10

As per Unicode stability policy, case mappings are only stable for case pairs, i.e. pairs of characters X and Y where X is the full uppercase mapping of Y, and Y is the full lowercase mapping of X. Only when both these characters exist with these properties is the casing relation between them set in stone.

However, Unicode contains many “incomplete” case pairs where only the lowercase form has been encoded and the uppercase form is missing completely. This is usually the case for letters used in transcription systems that are traditionally lowercase-only. Should capital forms be discovered and subsequently added to Unicode, these letters would then receive a new uppercase mapping.

The most recent characters this has happened to are “ʂ” (from Unicode 1.1), “ᶎ” (from Unicode 4.1), and “ꞔ” (from Unicode 7.0), which all got brand new uppercase forms (Ꞔ, Ʂ, Ᶎ) in Unicode 12.0 two years ago.

Because case mappings do not have to be unique, this makes uppercasing a poor substitute for proper case-folding. For example, both U+0434 (д) and U+1C81 (ᲁ) uppercase to U+0414 (Д), but only the former is locked into a case pair by virtue of being U+0414’s full lowercase mapping. If someone were to find a dedicated capital letter version of U+1C81 in some old manuscript, it would be given a new uppercase mapping, resulting in U+0434 and U+1C81 suddenly no longer comparing equal under that operation.

EDIT: I have just remembered a current example of uppercasing not being sufficient for case-insensitive matching: U+1E9E (ẞ) is already a capital letter and thus uppercases to itself. Its lowercase counterpart is U+00DF (ß), but the uppercase mapping of U+00DF is the sequence <U+0053, U+0053> (SS).

uppercase("ẞ") ≠ uppercase(lowercase("ẞ"))
Maddy answered 15/4, 2021 at 12:32 Comment(8)
U+0434 and U+1C81 may have different upper case mappings in the future. Are there any in the current version?Grave
I am not aware of any characters that have diverged in their uppercase mappings since case pair stability came into effect. I will update my answer if I find an example.Maddy
I did remember a pair of characters for which case-insensitive matching via uppercasing would fail. The answer has been amended.Maddy
Your example is an instance where unicode is wrong. There should be a case pair mapping between ß and ẞ.Helicoid
@flyingsheep The official uppercase of “ß” remains “SS”; “ẞ” has just become an acceptable alternative. At the time when capital sharp S was added to Unicode, German orthography did not yet recognise it as a real letter, so “ß” and “ẞ” weren’t made into a case pair. And now stability policy prevents this from ever happening in the future.Maddy
it doesn’t, that policy only applies to simple case pairs, i.e. ones unlike this one.Helicoid
@flyingsheep “If two characters do not form a case pair in a version of Unicode, they will never become a case pair in any subsequent version of Unicode.” — unicode.org/policies/stability_policy.html#Case_PairMaddy
Yes, I know. That’s the simple case pairs thing I was talking about. However, there also exists the full (language aware) mapping specification (which e.g. includes the ß->SS uppercasing rule). The rules here may change. unicode-org.github.io/icu/userguide/transforms/…Helicoid
G
1

I found a list from here.

As of Unicode 13.0.0.

Equivalence classes that have more than 1 uppercase mapping.

case fold original UPPER CASE
k 006B LATIN SMALL LETTER K K 004B LATIN CAPITAL LETTER K K 004B LATIN CAPITAL LETTER K
k 006B LATIN SMALL LETTER K K 004B LATIN CAPITAL LETTER K
K 212A KELVIN SIGN K 212A KELVIN SIGN
ss 0073 LATIN SMALL LETTER S; 0073 LATIN SMALL LETTER S ß 00DF LATIN SMALL LETTER SHARP S SS 0053 LATIN CAPITAL LETTER S; 0053 LATIN CAPITAL LETTER S
ẞ 1E9E LATIN CAPITAL LETTER SHARP S ẞ 1E9E LATIN CAPITAL LETTER SHARP S
å 00E5 LATIN SMALL LETTER A WITH RING ABOVE Å 00C5 LATIN CAPITAL LETTER A WITH RING ABOVE Å 00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
å 00E5 LATIN SMALL LETTER A WITH RING ABOVE Å 00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
Å 212B ANGSTROM SIGN Å 212B ANGSTROM SIGN
θ 03B8 GREEK SMALL LETTER THETA Θ 0398 GREEK CAPITAL LETTER THETA Θ 0398 GREEK CAPITAL LETTER THETA
θ 03B8 GREEK SMALL LETTER THETA Θ 0398 GREEK CAPITAL LETTER THETA
ϑ 03D1 GREEK THETA SYMBOL Θ 0398 GREEK CAPITAL LETTER THETA
ϴ 03F4 GREEK CAPITAL THETA SYMBOL ϴ 03F4 GREEK CAPITAL THETA SYMBOL
ω 03C9 GREEK SMALL LETTER OMEGA Ω 03A9 GREEK CAPITAL LETTER OMEGA Ω 03A9 GREEK CAPITAL LETTER OMEGA
ω 03C9 GREEK SMALL LETTER OMEGA Ω 03A9 GREEK CAPITAL LETTER OMEGA
Ω 2126 OHM SIGN Ω 2126 OHM SIGN

And for lowercasing.

case fold original lower case
s 0073 LATIN SMALL LETTER S S 0053 LATIN CAPITAL LETTER S s 0073 LATIN SMALL LETTER S
s 0073 LATIN SMALL LETTER S s 0073 LATIN SMALL LETTER S
ſ 017F LATIN SMALL LETTER LONG S ſ 017F LATIN SMALL LETTER LONG S
st 0073 LATIN SMALL LETTER S; 0074 LATIN SMALL LETTER T ſt FB05 LATIN SMALL LIGATURE LONG S T ſt FB05 LATIN SMALL LIGATURE LONG S T
st FB06 LATIN SMALL LIGATURE ST st FB06 LATIN SMALL LIGATURE ST
β 03B2 GREEK SMALL LETTER BETA Β 0392 GREEK CAPITAL LETTER BETA β 03B2 GREEK SMALL LETTER BETA
β 03B2 GREEK SMALL LETTER BETA β 03B2 GREEK SMALL LETTER BETA
ϐ 03D0 GREEK BETA SYMBOL ϐ 03D0 GREEK BETA SYMBOL
ε 03B5 GREEK SMALL LETTER EPSILON Ε 0395 GREEK CAPITAL LETTER EPSILON ε 03B5 GREEK SMALL LETTER EPSILON
ε 03B5 GREEK SMALL LETTER EPSILON ε 03B5 GREEK SMALL LETTER EPSILON
ϵ 03F5 GREEK LUNATE EPSILON SYMBOL ϵ 03F5 GREEK LUNATE EPSILON SYMBOL
θ 03B8 GREEK SMALL LETTER THETA Θ 0398 GREEK CAPITAL LETTER THETA θ 03B8 GREEK SMALL LETTER THETA
θ 03B8 GREEK SMALL LETTER THETA θ 03B8 GREEK SMALL LETTER THETA
ϑ 03D1 GREEK THETA SYMBOL ϑ 03D1 GREEK THETA SYMBOL
ϴ 03F4 GREEK CAPITAL THETA SYMBOL θ 03B8 GREEK SMALL LETTER THETA
ι 03B9 GREEK SMALL LETTER IOTA ◌ͅ 0345 COMBINING GREEK YPOGEGRAMMENI ◌ͅ 0345 COMBINING GREEK YPOGEGRAMMENI
Ι 0399 GREEK CAPITAL LETTER IOTA ι 03B9 GREEK SMALL LETTER IOTA
ι 03B9 GREEK SMALL LETTER IOTA ι 03B9 GREEK SMALL LETTER IOTA
ι 1FBE GREEK PROSGEGRAMMENI ι 1FBE GREEK PROSGEGRAMMENI
ΐ 03B9 GREEK SMALL LETTER IOTA; 0308 COMBINING DIAERESIS; 0301 COMBINING ACUTE ACCENT ΐ 0390 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS ΐ 0390 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
ΐ 1FD3 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA ΐ 1FD3 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
κ 03BA GREEK SMALL LETTER KAPPA Κ 039A GREEK CAPITAL LETTER KAPPA κ 03BA GREEK SMALL LETTER KAPPA
κ 03BA GREEK SMALL LETTER KAPPA κ 03BA GREEK SMALL LETTER KAPPA
ϰ 03F0 GREEK KAPPA SYMBOL ϰ 03F0 GREEK KAPPA SYMBOL
μ 03BC GREEK SMALL LETTER MU µ 00B5 MICRO SIGN µ 00B5 MICRO SIGN
Μ 039C GREEK CAPITAL LETTER MU μ 03BC GREEK SMALL LETTER MU
μ 03BC GREEK SMALL LETTER MU μ 03BC GREEK SMALL LETTER MU
π 03C0 GREEK SMALL LETTER PI Π 03A0 GREEK CAPITAL LETTER PI π 03C0 GREEK SMALL LETTER PI
π 03C0 GREEK SMALL LETTER PI π 03C0 GREEK SMALL LETTER PI
ϖ 03D6 GREEK PI SYMBOL ϖ 03D6 GREEK PI SYMBOL
ρ 03C1 GREEK SMALL LETTER RHO Ρ 03A1 GREEK CAPITAL LETTER RHO ρ 03C1 GREEK SMALL LETTER RHO
ρ 03C1 GREEK SMALL LETTER RHO ρ 03C1 GREEK SMALL LETTER RHO
ϱ 03F1 GREEK RHO SYMBOL ϱ 03F1 GREEK RHO SYMBOL
σ 03C3 GREEK SMALL LETTER SIGMA Σ 03A3 GREEK CAPITAL LETTER SIGMA σ 03C3 GREEK SMALL LETTER SIGMA
ς 03C2 GREEK SMALL LETTER FINAL SIGMA ς 03C2 GREEK SMALL LETTER FINAL SIGMA
σ 03C3 GREEK SMALL LETTER SIGMA σ 03C3 GREEK SMALL LETTER SIGMA
ΰ 03C5 GREEK SMALL LETTER UPSILON; 0308 COMBINING DIAERESIS; 0301 COMBINING ACUTE ACCENT ΰ 03B0 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS ΰ 03B0 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
ΰ 1FE3 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA ΰ 1FE3 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
φ 03C6 GREEK SMALL LETTER PHI Φ 03A6 GREEK CAPITAL LETTER PHI φ 03C6 GREEK SMALL LETTER PHI
φ 03C6 GREEK SMALL LETTER PHI φ 03C6 GREEK SMALL LETTER PHI
ϕ 03D5 GREEK PHI SYMBOL ϕ 03D5 GREEK PHI SYMBOL
в 0432 CYRILLIC SMALL LETTER VE В 0412 CYRILLIC CAPITAL LETTER VE в 0432 CYRILLIC SMALL LETTER VE
в 0432 CYRILLIC SMALL LETTER VE в 0432 CYRILLIC SMALL LETTER VE
ᲀ 1C80 CYRILLIC SMALL LETTER ROUNDED VE ᲀ 1C80 CYRILLIC SMALL LETTER ROUNDED VE
д 0434 CYRILLIC SMALL LETTER DE Д 0414 CYRILLIC CAPITAL LETTER DE д 0434 CYRILLIC SMALL LETTER DE
д 0434 CYRILLIC SMALL LETTER DE д 0434 CYRILLIC SMALL LETTER DE
ᲁ 1C81 CYRILLIC SMALL LETTER LONG-LEGGED DE ᲁ 1C81 CYRILLIC SMALL LETTER LONG-LEGGED DE
о 043E CYRILLIC SMALL LETTER O О 041E CYRILLIC CAPITAL LETTER O о 043E CYRILLIC SMALL LETTER O
о 043E CYRILLIC SMALL LETTER O о 043E CYRILLIC SMALL LETTER O
ᲂ 1C82 CYRILLIC SMALL LETTER NARROW O ᲂ 1C82 CYRILLIC SMALL LETTER NARROW O
с 0441 CYRILLIC SMALL LETTER ES С 0421 CYRILLIC CAPITAL LETTER ES с 0441 CYRILLIC SMALL LETTER ES
с 0441 CYRILLIC SMALL LETTER ES с 0441 CYRILLIC SMALL LETTER ES
ᲃ 1C83 CYRILLIC SMALL LETTER WIDE ES ᲃ 1C83 CYRILLIC SMALL LETTER WIDE ES
т 0442 CYRILLIC SMALL LETTER TE Т 0422 CYRILLIC CAPITAL LETTER TE т 0442 CYRILLIC SMALL LETTER TE
т 0442 CYRILLIC SMALL LETTER TE т 0442 CYRILLIC SMALL LETTER TE
ᲄ 1C84 CYRILLIC SMALL LETTER TALL TE ᲄ 1C84 CYRILLIC SMALL LETTER TALL TE
ᲅ 1C85 CYRILLIC SMALL LETTER THREE-LEGGED TE ᲅ 1C85 CYRILLIC SMALL LETTER THREE-LEGGED TE
ъ 044A CYRILLIC SMALL LETTER HARD SIGN Ъ 042A CYRILLIC CAPITAL LETTER HARD SIGN ъ 044A CYRILLIC SMALL LETTER HARD SIGN
ъ 044A CYRILLIC SMALL LETTER HARD SIGN ъ 044A CYRILLIC SMALL LETTER HARD SIGN
ᲆ 1C86 CYRILLIC SMALL LETTER TALL HARD SIGN ᲆ 1C86 CYRILLIC SMALL LETTER TALL HARD SIGN
ѣ 0463 CYRILLIC SMALL LETTER YAT Ѣ 0462 CYRILLIC CAPITAL LETTER YAT ѣ 0463 CYRILLIC SMALL LETTER YAT
ѣ 0463 CYRILLIC SMALL LETTER YAT ѣ 0463 CYRILLIC SMALL LETTER YAT
ᲇ 1C87 CYRILLIC SMALL LETTER TALL YAT ᲇ 1C87 CYRILLIC SMALL LETTER TALL YAT
ṡ 1E61 LATIN SMALL LETTER S WITH DOT ABOVE Ṡ 1E60 LATIN CAPITAL LETTER S WITH DOT ABOVE ṡ 1E61 LATIN SMALL LETTER S WITH DOT ABOVE
ṡ 1E61 LATIN SMALL LETTER S WITH DOT ABOVE ṡ 1E61 LATIN SMALL LETTER S WITH DOT ABOVE
ẛ 1E9B LATIN SMALL LETTER LONG S WITH DOT ABOVE ẛ 1E9B LATIN SMALL LETTER LONG S WITH DOT ABOVE
ꙋ A64B CYRILLIC SMALL LETTER MONOGRAPH UK ᲈ 1C88 CYRILLIC SMALL LETTER UNBLENDED UK ᲈ 1C88 CYRILLIC SMALL LETTER UNBLENDED UK
Ꙋ A64A CYRILLIC CAPITAL LETTER MONOGRAPH UK ꙋ A64B CYRILLIC SMALL LETTER MONOGRAPH UK
ꙋ A64B CYRILLIC SMALL LETTER MONOGRAPH UK ꙋ A64B CYRILLIC SMALL LETTER MONOGRAPH UK

And for lowercase(uppercase(X)).

case fold original lower case of upper case
ss 0073 LATIN SMALL LETTER S; 0073 LATIN SMALL LETTER S ß 00DF LATIN SMALL LETTER SHARP S ss 0073 LATIN SMALL LETTER S; 0073 LATIN SMALL LETTER S
ẞ 1E9E LATIN CAPITAL LETTER SHARP S ß 00DF LATIN SMALL LETTER SHARP S

For uppercase(lowercase(s)), no equivalence group has multiple results.

Grave answered 15/4, 2021 at 10:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.