Python UTF-8 Lowercase Turkish Specific Letter

Asked 26/9, 2013 at 14:25 Answered 23/5, 2024 at 1:26

with using python 2.7:

>myCity = 'Isparta'
>myCity.lower()
>'isparta'
#-should be-
>'ısparta'

tried some decoding, (like, myCity.decode("utf-8").lower()) but could not find how to do it.

how can lower this kinds of letters? ('I' > 'ı', 'İ' > 'i' etc)

EDIT: In Turkish, lower case of 'I' is 'ı'. Upper case of 'i' is 'İ'

Homeopathy answered 26/9, 2013 at 14:25 Comment(6)

Is that an ASCII capital letter Eye? If it's some non-ASCII character that looks like an ASCII character, it would be wise to name it unambigously (for example, by including the code point). – Kadiyevka 26/9, 2013 at 14:28

it is the ASCII capital letter, I. – Homeopathy 26/9, 2013 at 14:29

Is there a language wherein the lowercase version of ASCII capital I ("I") is something other than ASCII lowercase I ("i")? otherwise, I'm horribly confused by this question, because what you are showing is exactly the proper behavior. – Pimentel 26/9, 2013 at 14:37

@KenB: Turkish for example. Which is why that culture is a common test for i18n-proofing code that compares user input with string literals. – Krantz 26/9, 2013 at 15:3

@Jeff Atwood once write about that, it is better you read this article Also, this is the best article written about the Turkish Locale I guess. – Mcfarlane 26/9, 2013 at 15:14

Well that is just darn interesting. I learned something new today. @FallenAngel, great link – Pimentel 26/9, 2013 at 15:47

Some have suggested using the tr_TR.utf8 locale. At least on Ubuntu, perhaps related to this bug, setting this locale does not produce the desired result:

import locale
locale.setlocale(locale.LC_ALL, 'tr_TR.utf8')

myCity = u'Isparta İsparta'
print(myCity.lower())
# isparta isparta

So if this bug affects you, as a workaround you could perform this translation yourself:

lower_map = {
    ord(u'I'): u'ı',
    ord(u'İ'): u'i',
    }

myCity = u'Isparta İsparta'
lowerCity = myCity.translate(lower_map)
print(lowerCity)
# ısparta isparta

prints

ısparta isparta

Torin answered 26/9, 2013 at 14:54 Comment(2)

There is one, actually. – Krantz 26/9, 2013 at 15:8

Obviously this is a late comment, but at least for python 3.7, there is no locale sensitive case comparison. See here on the locale page. – Nonsuit 29/4, 2019 at 0:44

You should use new derived class from unicode from emre's solution

class unicode_tr(unicode):
    CHARMAP = {
        "to_upper": {
            u"ı": u"I",
            u"i": u"İ",
        },
        "to_lower": {
            u"I": u"ı",
            u"İ": u"i",
        }
    }

    def lower(self):
        for key, value in self.CHARMAP.get("to_lower").items():
            self = self.replace(key, value)
        return self.lower()

    def upper(self):
        for key, value in self.CHARMAP.get("to_upper").items():
            self = self.replace(key, value)
        return self.upper()

if __name__ == '__main__':
    print unicode_tr("kitap").upper()
    print unicode_tr("KİTAP").lower()

Gives

KİTAP
kitap

This must solve your problem.

Tramp answered 2/1, 2014 at 15:45 Comment(1)

Note that link-only answers are discouraged, SO answers should be the end-point of a search for a solution (vs. yet another stopover of references, which tend to get stale over time). Please consider adding a stand-alone synopsis here, keeping the link as a reference. – Fibroin 2/1, 2014 at 15:50

You can just use .replace() function before changing to upper/lower. In your case:

    myCity.replace('I', 'ı').lower()

Gandhi answered 11/6, 2019 at 8:30 Comment(0)

I forked and redesigned Emre's solution by monkey-patching method to built-in unicode module. The advantage of this new approach is no need to use a subclass of unicode and redefining unicode strings by my_unicode_string = unicode_tr(u'bla bla bla') Just importing this module, integrates seamlessly with builtin native unicode strings

https://github.com/technic-programming/unicode_tr

# -*- coding: utf8 -*-
# Redesigned by @guneysus

import __builtin__
from forbiddenfruit import curse

lcase_table = tuple(u'abcçdefgğhıijklmnoöprsştuüvyz')
ucase_table = tuple(u'ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ')

def upper(data):
    data = data.replace('i',u'İ')
    data = data.replace(u'ı',u'I')
    result = ''
    for char in data:
        try:
            char_index = lcase_table.index(char)
            ucase_char = ucase_table[char_index]
        except:
            ucase_char = char
        result += ucase_char
    return result

def lower(data):
    data = data.replace(u'İ',u'i')
    data = data.replace(u'I',u'ı')
    result = ''
    for char in data:
        try:
            char_index = ucase_table.index(char)
            lcase_char = lcase_table[char_index]
        except:
            lcase_char = char
        result += lcase_char
    return result

def capitalize(data):
    return data[0].upper() + data[1:].lower()

def title(data):
    return " ".join(map(lambda x: x.capitalize(), data.split()))

curse(__builtin__.unicode, 'upper', upper)
curse(__builtin__.unicode, 'lower', lower)
curse(__builtin__.unicode, 'capitalize', capitalize)
curse(__builtin__.unicode, 'title', title)

if __name__ == '__main__':
    print u'istanbul'.upper()
    print u'İSTANBUL'.lower()

Tramp answered 12/12, 2014 at 22:36 Comment(0)

You need to set the proper locale (I'm guessing tr-TR) with locale.setLocale(). Otherwise the default upper-lower mappings will be used, and if that default is en-US, the lowercase version of I is i.

Mitran answered 26/9, 2013 at 14:31 Comment(1)

I downvoted this answer because setting the locale to tr_TR does not change the behavior of str.upper/str.lower on the letters i/I. – Zitella 21/8, 2020 at 3:20

This is an old question, but I will add an answer for future reference that fills in the gaps of existing discussions of Turkish case mapping.

Python's string methods are based on language and locale insensitive solutions, so cases like Turkish, Azeri, Tatar, Kazakh, Lithuanian and Polytonic Greek require special handling. Or Dutch titlecasing.

The locale module offers no solutions, since libc and POSIX provide no support for case mapping and case folding within their locale model. Case mapping and case folding are supported in CLDR based locale solutions.

Unicode defines simple and full case mapping and simple and full case folding. Simple case mapping operates on single characters and maps single characters to single characters. Case insensitivity in some regular expression engines is based on simple case folding for instance.

Full case mapping can be language and locale insensitive. This is what Python uses, and what the root locale in CLDR uses.

But there is also language/locale sensitive locale mapping. To access this on Python, you need to either write your own function or class to handle language sensitive case mapping, or you you can use PyICU. Other questions illustrate functions and classes for handling casing with languages that use the Common Turkic Alphabet, although it would need to be generalised more to support other locale sensitive case mapping.

PyICU exposes three classes that cane be used for case mapping and case folding.

icu.UnicodeString supports locale in-/sensitive casing on UnicodeString objects.
icu.CaseMap supports locale in-/sensitive casing on string objects.
icu.Char supports simple casing on single character strings.

These three classes are available in icu4c and icu4j, while icu4x uses icu::CaseMap to access both full and simple casing (including locale insensitive and sensitive mappings).

To illustrate lowercasing of I <U+0049> and İ <U+0130>:

# <U+0049> lowers to <U+0069>
# This will roundtrip
print("\u0049".lower())
# i

# <U+0130> lowers to <U+0069, U+0307>
# "\u0130".lower().upper() will round trip, but Unicode normalisation form will change
print("\u0130".lower())
# i̇

It is important to note that Unicode case mapping is asymmetric, especially when using language or locale insensitive casing operations.

If we look at icu.CaseMap:

import icu
tr = icu.Locale('tr')
root = icu.Locale('und')
# root = icu.Locale.getRoot()

cm = icu.CaseMap
# Using the root locale: <U+0049> lowers to <U+0069>
print(cm.toLower(root, "\u0049"))
# i
# Using the Turkish locale: <U+0049> lowers to <U+0131>
print(cm.toLower(tr, "\u0049"))
# ı

As can be seen icu.CaseMap can support locale/language sensitive and insensitive case mappings.

It is possible to write a partial function to assist, a variation of the other answers where a custom function is used for Turkish casing:

from functools import partial
import icu
tr_lower = partial(icu.CaseMap.toLower, icu.Locale('tr'))
print(tr_lower('\u0049'))
# ı

Likewise with icu.UnicodeString:

print(str(icu.UnicodeString('\u0049').toLower(root)))
# i
print(str(icu.UnicodeString('\u0049').toLower(tr)))
# ı

With icu.UnicodeString you need to create a UnicodeString instance of the Python string, and it returns a UnicodeString object that needs to be typecast to a Python string object. Making the icu.CharMap alternative more desireable.

icu.Char is locale and language insensitive, and does not operate on a string of length greater than one. It uses simple case mapping:

# <U+0049> lowers to <U+0069> as expected.
icu.Char.tolower('\u0049')
# 'i'

# And <U+0130> also lowers to <U+0069>.
print(icu.Char.tolower('\u0130'))
# i

In simple case mapping both <U+0069> and <U+0130> lowercase to <U+0069>.

You may encounter this behaviour in embedded systems that opt for the lower overheads of simple case mapping and simple case folding.

One important thing to not is that Unicode has three cases: upper, lower and title.

In Python str.title() isn't a case mapping operation, it is a string transformation, and its behaviour differs dramatically from Unicode's. Assume we have a city name all in uppercase, and at some point our code does a language/locale insensitive lower-casing of the string, and later we need to title case it:

city = "İZMİR"
city = city.lower()
city.title()
# 'İZmi̇R'

İZMİR lower cases to i̇zmi̇r which in turn title cases to İZmi̇R. Python does not consider combining diacritics (non-spacing marks) as word forming characters. You'll see similar behaviour in the meta-character \w in the re module.

Python has a str.capitalize method, but there is no corresponding method in PyICU, instead you'd add an option to the toTitle method:

s = "türk dilleri"
print(s.capitalize())
# Türk dilleri

# The toTitle() method without the option:
print(cm.toTitle(tr, s))
# Türk Dilleri

# Adding the option:
print(cm.toTitle(tr, icu.U_TITLECASE_SENTENCES, s))
# Türk dilleri
print(cm.toTitle(tr, 64, s))
# Türk dilleri

The option is an enumerated constant for a bit mask, you can use the constant icu.U_TITLECASE_SENTENCES or the integer 64.

So it is possible to simplify things by using a partial function:

tr_capitalize = partial(icu.CaseMap.toTitle, icu.Locale('tr'), icu.U_TITLECASE_SENTENCES)
print(tr_capitalize(s))
# Türk dilleri

Alejoa answered 23/5, 2024 at 1:26 Comment(0)

Recommended topics

Hot tags