Should I use Python casefold?
Asked Answered
G

1

9

Been recently reading on casefold and string comparisons when ignoring case. I've read that the MSDN standard is to use InvariantCulture and definitely avoid toLowercase. However, casefold from what I have read is like a more aggressive toLowercase. My question is should I use casefold in Python or is there a more pythonic standard to use instead? Also, does casefold pass the Turkey Test?

Grow answered 31/10, 2016 at 18:21 Comment(5)
1. What casefold does is explained in the docs. 2. What does "better" mean in this case? 3. What's the Turkish Test (and have you tried running it to find out)?Elbaelbart
@Elbaelbart Sorry, meant more pythonic and also meant Turkey Test. I just want to know what good programmers use when they want to do caseless comparisons in python.Grow
@Elbaelbart - the turkish test is described in more detail here https://mcmap.net/q/41774/-what-is-the-turkey-testSnitch
Have you tried casefold to see for yourself whether it passes the Turkey Test?Elbaelbart
@Elbaelbart I just honestly haven't had the time to try it. I also haven't encountered a situation where I would need to use casefold yet. This was just a question that I had in my mind after doing some idle research. I'll be sure to post my results if I do get to testing it though. In the end, my biggest question is still: Is casefold the most pythonic way to ignore case?Grow
P
19

1) In Python 3, casefold() should be used to implement caseless string matching.

Starting with Python 3.0, strings are stored as Unicode. The Unicode Standard Chapter 3.13 defines the default caseless matching as follows:

A string X is a caseless match for a string Y if and only if:
toCasefold(X) = toCasefold(Y)

Python's casefold() implements the Unicode's toCasefold(). So, it should be used to implement caseless string matching. Although, casefolding alone is not enough to cover some corner cases and to pass the Turkey Test (see Point 3).

2) As of Python 3.6, casefold() cannot pass the Turkey Test.

For two characters, uppercase I and dotted uppercase I, the Unicode Standard defines two different casefolding mappings.

The default (for non-Turkic languages):
I → i (U+0049 → U+0069)
İ → i̇ (U+0130 → U+0069 U+0307)

The alternative (for Turkic languages):
I → ı (U+0049 → U+0131)
İ → i (U+0130 → U+0069)

Pythons casefold() can apply only the default mapping and fails the Turkey Test. For example, the Turkish words "LİMANI" and "limanı" are caseless equivalents, but "LİMANI".casefold() == "limanı".casefold() returns False. There is no option to enable the alternative mapping.

3) How to do caseless string matching in Python 3.

The Unicode Standard Chapter 3.13 describes several caseless matching algorithms. The canonical casless matching would probably suit most use cases. This algorithm already takes into account all corner cases. We only need to add an option to switch between non-Turkic and Turkic casefolding.

import unicodedata

def normalize_NFD(string):
    return unicodedata.normalize('NFD', string)

def casefold_(string, include_special_i=False):
    if include_special_i:
        string = unicodedata.normalize('NFC', string)
        string = string.replace('\u0049', '\u0131')
        string = string.replace('\u0130', '\u0069')
    return string.casefold()

def casefold_NFD(string, include_special_i=False):
    return normalize_NFD(casefold_(normalize_NFD(string), include_special_i))

def caseless_match(string1, string2, include_special_i=False):
    return  casefold_NFD(string1, include_special_i) == casefold_NFD(string2, include_special_i)

casefold_() is a wrapper for Python's casefold(). If its parameter include_special_i is set to True, then it applies the Turkic mapping, and if it is set to False the default mapping is used.

caseless_match() does the canonical casless matching for string1 and string2. If the strings are Turkic words, include_special_i parameter must be set to True.

Examples:

>>> caseless_match('LİMANI', 'limanı', include_special_i=True)
True
>>> caseless_match('LİMANI', 'limanı')
False
>>> caseless_match('INTENSIVE', 'intensive', include_special_i=True)
False
>>> caseless_match('INTENSIVE', 'intensive')
True
Pretorius answered 24/12, 2016 at 18:51 Comment(2)
Nice work. casefold_ function doesn't need to end in a _ since it is not shadowing a builtin or keyword.Lepp
alternatively there is PyICU's implementation of casefolding which has an optional parameter to include the Turkisc casefolding rules in Unicode to full casefolding.Gardenia

© 2022 - 2024 — McMap. All rights reserved.