How is the built-in function str.lower() implemented?

int _PyUnicode_ToLowerFull(Py_UCS4 ch, Py_UCS4 *res) { const _PyUnicode_TypeRecord *ctype = gettyperecord(ch); if (ctype->flags & EXTENDED_CASE_MASK) { int index = ctype->lower & 0xFFFF; int n = ctype->lower >> 24; int i; for (i = 0; i < n; i++) res[i] = _PyUnicode_ExtendedCase[index + i]; return n; } res[0] = ch + ctype->lower; return 1; }

There are two branches in the function you show. Which branch runs depends on the flags field of the _PyUnicode_TypeRecord field for the character in question. If it has the EXTENDED_CASE_MASK bit set, a more complicated bit of code runs, otherwise a simpler version is used.

Lets look at the simple part first:

res[0] = ch + ctype->lower;
return 1;

This simply adds the value of the lower field as an offset to the input codepoint, assigns it into the first place in the res return argument and returns 1 (since it's used one character).

Now for the more complicated version:

int index = ctype->lower & 0xFFFF;
int n = ctype->lower >> 24;
int i;
for (i = 0; i < n; i++)
    res[i] = _PyUnicode_ExtendedCase[index + i];
return n;

In this version, the lower field is interpreted as two different numbers. The lowest 16 bits are index, while the uppermost bits become n (the number of characters to be output). The code then loops over the n characters in the _PyUnicode_ExtendedCase array starting at index, copying them into the res array. Finally it returns the number of characters used.

This more complicated code is needed to handle case changes for Unicode codepoints that represent a ligature of two characters (generally for obscure historical reasons, such as because they would have been on a single type block in ancient moveable type printing). These ligatures may only exist in a single case if the characters in other cases don't overlap as much. As an example, the character 'ﬂ' is a ligature of the lowercase characters 'f' and 'l'. No uppercase version of the ligature exists, so 'ﬂ'.upper() needs to return a two-character string ('FL').

Recommended topics

Hot tags