This is an old question, but I will add an answer for future reference that fills in the gaps of existing discussions of Turkish case mapping.
Python's string methods are based on language and locale insensitive solutions, so cases like Turkish, Azeri, Tatar, Kazakh, Lithuanian and Polytonic Greek require special handling. Or Dutch titlecasing.
The locale module offers no solutions, since libc and POSIX provide no support for case mapping and case folding within their locale model. Case mapping and case folding are supported in CLDR based locale solutions.
Unicode defines simple and full case mapping and simple and full case folding. Simple case mapping operates on single characters and maps single characters to single characters. Case insensitivity in some regular expression engines is based on simple case folding for instance.
Full case mapping can be language and locale insensitive. This is what Python uses, and what the root locale in CLDR uses.
But there is also language/locale sensitive locale mapping. To access this on Python, you need to either write your own function or class to handle language sensitive case mapping, or you you can use PyICU. Other questions illustrate functions and classes for handling casing with languages that use the Common Turkic Alphabet, although it would need to be generalised more to support other locale sensitive case mapping.
PyICU exposes three classes that cane be used for case mapping and case folding.
icu.UnicodeString
supports locale in-/sensitive casing on UnicodeString
objects.
icu.CaseMap
supports locale in-/sensitive casing on string objects.
icu.Char
supports simple casing on single character strings.
These three classes are available in icu4c and icu4j, while icu4x uses icu::CaseMap
to access both full and simple casing (including locale insensitive and sensitive mappings).
To illustrate lowercasing of I <U+0049>
and İ <U+0130>
:
# <U+0049> lowers to <U+0069>
# This will roundtrip
print("\u0049".lower())
# i
# <U+0130> lowers to <U+0069, U+0307>
# "\u0130".lower().upper() will round trip, but Unicode normalisation form will change
print("\u0130".lower())
# i̇
It is important to note that Unicode case mapping is asymmetric, especially when using language or locale insensitive casing operations.
If we look at icu.CaseMap
:
import icu
tr = icu.Locale('tr')
root = icu.Locale('und')
# root = icu.Locale.getRoot()
cm = icu.CaseMap
# Using the root locale: <U+0049> lowers to <U+0069>
print(cm.toLower(root, "\u0049"))
# i
# Using the Turkish locale: <U+0049> lowers to <U+0131>
print(cm.toLower(tr, "\u0049"))
# ı
As can be seen icu.CaseMap
can support locale/language sensitive and insensitive case mappings.
It is possible to write a partial function to assist, a variation of the other answers where a custom function is used for Turkish casing:
from functools import partial
import icu
tr_lower = partial(icu.CaseMap.toLower, icu.Locale('tr'))
print(tr_lower('\u0049'))
# ı
Likewise with icu.UnicodeString
:
print(str(icu.UnicodeString('\u0049').toLower(root)))
# i
print(str(icu.UnicodeString('\u0049').toLower(tr)))
# ı
With icu.UnicodeString
you need to create a UnicodeString instance of the Python string, and it returns a UnicodeString object that needs to be typecast to a Python string object. Making the icu.CharMap
alternative more desireable.
icu.Char
is locale and language insensitive, and does not operate on a string of length greater than one. It uses simple case mapping:
# <U+0049> lowers to <U+0069> as expected.
icu.Char.tolower('\u0049')
# 'i'
# And <U+0130> also lowers to <U+0069>.
print(icu.Char.tolower('\u0130'))
# i
In simple case mapping both <U+0069>
and <U+0130>
lowercase to <U+0069>
.
You may encounter this behaviour in embedded systems that opt for the lower overheads of simple case mapping and simple case folding.
One important thing to not is that Unicode has three cases: upper, lower and title.
In Python str.title()
isn't a case mapping operation, it is a string transformation, and its behaviour differs dramatically from Unicode's. Assume we have a city name all in uppercase, and at some point our code does a language/locale insensitive lower-casing of the string, and later we need to title case it:
city = "İZMİR"
city = city.lower()
city.title()
# 'İZmi̇R'
İZMİR
lower cases to i̇zmi̇r
which in turn title cases to İZmi̇R
. Python does not consider combining diacritics (non-spacing marks) as word forming characters. You'll see similar behaviour in the meta-character \w
in the re module.
Python has a str.capitalize
method, but there is no corresponding method in PyICU, instead you'd add an option to the toTitle
method:
s = "türk dilleri"
print(s.capitalize())
# Türk dilleri
# The toTitle() method without the option:
print(cm.toTitle(tr, s))
# Türk Dilleri
# Adding the option:
print(cm.toTitle(tr, icu.U_TITLECASE_SENTENCES, s))
# Türk dilleri
print(cm.toTitle(tr, 64, s))
# Türk dilleri
The option is an enumerated constant for a bit mask, you can use the constant icu.U_TITLECASE_SENTENCES
or the integer 64
.
So it is possible to simplify things by using a partial function:
tr_capitalize = partial(icu.CaseMap.toTitle, icu.Locale('tr'), icu.U_TITLECASE_SENTENCES)
print(tr_capitalize(s))
# Türk dilleri