How is the built-in function str.lower() implemented?
Asked Answered
C

1

2

I wonder how str.lower() is implemented in Python, so I cloned the cpython repository and did some search with grep. After a few jumps starting from unicode_lower in Objects/unicodeobject.c, I came across to this inside Objects/unicodetype.c:

int _PyUnicode_ToLowerFull(Py_UCS4 ch, Py_UCS4 *res)
{
    const _PyUnicode_TypeRecord *ctype = gettyperecord(ch);

    if (ctype->flags & EXTENDED_CASE_MASK) {
        int index = ctype->lower & 0xFFFF;
        int n = ctype->lower >> 24;
        int i;
        for (i = 0; i < n; i++)
            res[i] = _PyUnicode_ExtendedCase[index + i];
        return n;
    }
    res[0] = ch + ctype->lower;
    return 1;
}

I am familiar with C, but pretty unfamiliar with how python is implemented (but want to change that!). I don't really understand what is going on, so seeking help here for some clear explanation.

Confluent answered 1/2, 2017 at 6:52 Comment(1)
@Recondition Thanks for bringing me into present :DTragicomedy
R
4

There are two branches in the function you show. Which branch runs depends on the flags field of the _PyUnicode_TypeRecord field for the character in question. If it has the EXTENDED_CASE_MASK bit set, a more complicated bit of code runs, otherwise a simpler version is used.

Lets look at the simple part first:

res[0] = ch + ctype->lower;
return 1;

This simply adds the value of the lower field as an offset to the input codepoint, assigns it into the first place in the res return argument and returns 1 (since it's used one character).

Now for the more complicated version:

int index = ctype->lower & 0xFFFF;
int n = ctype->lower >> 24;
int i;
for (i = 0; i < n; i++)
    res[i] = _PyUnicode_ExtendedCase[index + i];
return n;

In this version, the lower field is interpreted as two different numbers. The lowest 16 bits are index, while the uppermost bits become n (the number of characters to be output). The code then loops over the n characters in the _PyUnicode_ExtendedCase array starting at index, copying them into the res array. Finally it returns the number of characters used.

This more complicated code is needed to handle case changes for Unicode codepoints that represent a ligature of two characters (generally for obscure historical reasons, such as because they would have been on a single type block in ancient moveable type printing). These ligatures may only exist in a single case if the characters in other cases don't overlap as much. As an example, the character 'fl' is a ligature of the lowercase characters 'f' and 'l'. No uppercase version of the ligature exists, so 'fl'.upper() needs to return a two-character string ('FL').

Recondition answered 1/2, 2017 at 9:7 Comment(2)
+1 for bringing up the case of ligatures Just for my understanding does _PyUnicode_ExtendedCase basically act as a mapping from lower/upper cases to lower case? Where does _PyUnicode_ExtendedCase come from and how does it look like? And I am still unclear about the 'simple part'. Why is only the first element in the array modified? Is it only considering capitalized words?Confluent
The _PyUnicode_ExtendedCase array maps between all kinds of cases, but it's only used for the situations where one character in one case maps to multiple characters in another case. One to one mappings are handled in what I was calling the "simple case", which is why that part of the code only assigns to one character in the output array. The data (including both _PyUnicode_ExtendedCase and a big table of _PyUnicode_TypeRecords, among other things) is defined in the Objects/unicodetype_db.h file.Recondition

© 2022 - 2024 — McMap. All rights reserved.