use string.translate in Python to transliterate Cyrillic?
Asked Answered
G

4

26

I'm getting UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-51: ordinal not in range(128) exception trying to use string.maketrans in Python. I'm kinda discouraged with this kind of error in following code (gist):

# -*- coding: utf-8 -*-

import string

def translit1(string):
    """ This function works just fine """
    capital_letters = {
        u'А': u'A',
        u'Б': u'B',
        u'В': u'V',
        u'Г': u'G',
        u'Д': u'D',
        u'Е': u'E',
        u'Ё': u'E',
        u'Ж': u'Zh',
        u'З': u'Z',
        u'И': u'I',
        u'Й': u'Y',
        u'К': u'K',
        u'Л': u'L',
        u'М': u'M',
        u'Н': u'N',
        u'О': u'O',
        u'П': u'P',
        u'Р': u'R',
        u'С': u'S',
        u'Т': u'T',
        u'У': u'U',
        u'Ф': u'F',
        u'Х': u'H',
        u'Ц': u'Ts',
        u'Ч': u'Ch',
        u'Ш': u'Sh',
        u'Щ': u'Sch',
        u'Ъ': u'',
        u'Ы': u'Y',
        u'Ь': u'',
        u'Э': u'E',
        u'Ю': u'Yu',
        u'Я': u'Ya'
    }

    lower_case_letters = {
        u'а': u'a',
        u'б': u'b',
        u'в': u'v',
        u'г': u'g',
        u'д': u'd',
        u'е': u'e',
        u'ё': u'e',
        u'ж': u'zh',
        u'з': u'z',
        u'и': u'i',
        u'й': u'y',
        u'к': u'k',
        u'л': u'l',
        u'м': u'm',
        u'н': u'n',
        u'о': u'o',
        u'п': u'p',
        u'р': u'r',
        u'с': u's',
        u'т': u't',
        u'у': u'u',
        u'ф': u'f',
        u'х': u'h',
        u'ц': u'ts',
        u'ч': u'ch',
        u'ш': u'sh',
        u'щ': u'sch',
        u'ъ': u'',
        u'ы': u'y',
        u'ь': u'',
        u'э': u'e',
        u'ю': u'yu',
        u'я': u'ya'
    }

    translit_string = ""

    for index, char in enumerate(string):
        if char in lower_case_letters.keys():
            char = lower_case_letters[char]
        elif char in capital_letters.keys():
            char = capital_letters[char]
            if len(string) > index+1:
                if string[index+1] not in lower_case_letters.keys():
                    char = char.upper()
            else:
                char = char.upper()
        translit_string += char

    return translit_string


def translit2(text):
    """ This method should be more easy to grasp, 
    but throws exception:
    UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-51: ordinal not in range(128)
    """

    symbols = string.maketrans(u"абвгдеёзийклмнопрстуфхъыьэАБВГДЕЁЗИЙКЛМНОПРСТУФХЪЫЬЭ",
                               u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E")
    sequence = {
        u'ж':'zh',
        u'ц':'ts',
        u'ч':'ch',
        u'ш':'sh',
        u'щ':'sch',
        u'ю':'ju',
        u'я':'ja',
        u'Ж':'Zh',
        u'Ц':'Ts',
        u'Ч':'Ch'
    }

    for char in sequence.keys():
        text = text.replace(char, sequence[char])

    return text.translate(symbols)

if __name__ == "__main__":
    print translit1(u"Привет") # prints Privet as expected
    print translit2(u"Привет") # throws exception: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-51: ordinal not in range(128)

Original trace:

Traceback (most recent call last):
  File "translit_error.py", line 124, in <module>
    print translit2(u"Привет") # throws exception: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-51: ordinal not in range(128)
  File "translit_error.py", line 103, in translit2
    u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-51: ordinal not in range(128)

I mean, why Python string.maketrans trying to use ascii table anyway? And how comes English alphabet letters are out of 0-128 range?

$ python -c "print ord(u'A')"
65
$ python -c "print ord(u'z')"
122
$ python -c "print ord(u\"'\")"
39

After several hours I feel like absolutely exhausted to solve this issue.

Can someone say what is happening and how to fix it?

Gaul answered 5/1, 2013 at 15:49 Comment(5)
What version of Python are you using? IIRC, Python 2 maketrans cannot handle non-ASCII characters. (But Python 3 should be fine.)Cryptogram
From what I remember, the unicode version of maketrans requires you to map unicode characters to ordinals (I don't know why).Backrest
Python 2.7.3 - sorry, I didn't specify it. This is sad it's not in string.makestrans documentationGaul
Take a look at the unidecode module. It transliterates pretty well.Backrest
Thank you guys, you saved me a lot of time. @Blender, unfortuantely unidecode is not an option for me (explained below in @thg345's answer), though it's handy to use. @kojiro, the code above works fine with python 3 just as you said.Gaul
I
27

translate behaves differently when used with unicode strings. Instead of a maketrans table, you have to provide a dictionary ord(search)->ord(replace):

symbols = (u"абвгдеёжзийклмнопрстуфхцчшщъыьэюяАБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ",
           u"abvgdeejzijklmnoprstufhzcss_y_euaABVGDEEJZIJKLMNOPRSTUFHZCSS_Y_EUA")

tr = {ord(a):ord(b) for a, b in zip(*symbols)}

# for Python 2.*:
# tr = dict( [ (ord(a), ord(b)) for (a, b) in zip(*symbols) ] )

text = u'Добрый Ден'
print text.translate(tr)  # looks good

That said, I'd second the suggestion not to reinvent the wheel and to use an established library: http://pypi.python.org/pypi/Unidecode

Ivie answered 5/1, 2013 at 16:1 Comment(5)
Thanks, this works just fine. But I'm still mad that it's not pointed in the documentation to string.maketrans :) Unfortunately unidecode handles Cyrillic transliteration pretty ugly when it comes to transliteration of certain symbols in russian. My goal is to make URL slugs for (guess who? google of course) from titles written in russian, so I need make a transliteration of a slug that google would "understand". I tried unidecode on one world in russian and fed it to google, - wasn't satisfied with the result - google said "maybe you meant <another word>".Gaul
Some letters were missed, so C-V, C-P approach caused error :) The full russian-alphabet dict version should look like this: ` symbols = (u"абвгдеёжзийклмнопрстуфхцчшщъыьэюяАБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ", u"abvgdeejzijklmnoprstufhzcss_y_euaABVGDEEJZIJKLMNOPRSTUFHZCSS_Y_EUA")` .Freeman
@MInner: this is not going to work with translate, because it can only do one to one replacement.Ivie
@thg435 that is one-to-one. Russian letters just seem to be somehow wider :) That wasn't like criticism, but the thing a spent few minutes on - thinking why everything fails with letter "ч" is in the string.Freeman
@MInner: ok, looks I misunderstood your comment. Would you mind editing this info in, so that others with the same problem have an immediate copypaste solution?Ivie
C
35

You can use transliterate package (https://pypi.python.org/pypi/transliterate)

Example #1:

from transliterate import translit
print translit("Lorem ipsum dolor sit amet", "ru")
# Лорем ипсум долор сит амет

Example #2:

print translit(u"Лорем ипсум долор сит амет", "ru", reversed=True)
# Lorem ipsum dolor sit amet
Croquette answered 11/7, 2013 at 7:22 Comment(1)
Transliterate 1.7.3 has got some issues with the greek language github.com/barseghyanartur/transliterate/issues/8Carleencarlen
I
27

translate behaves differently when used with unicode strings. Instead of a maketrans table, you have to provide a dictionary ord(search)->ord(replace):

symbols = (u"абвгдеёжзийклмнопрстуфхцчшщъыьэюяАБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ",
           u"abvgdeejzijklmnoprstufhzcss_y_euaABVGDEEJZIJKLMNOPRSTUFHZCSS_Y_EUA")

tr = {ord(a):ord(b) for a, b in zip(*symbols)}

# for Python 2.*:
# tr = dict( [ (ord(a), ord(b)) for (a, b) in zip(*symbols) ] )

text = u'Добрый Ден'
print text.translate(tr)  # looks good

That said, I'd second the suggestion not to reinvent the wheel and to use an established library: http://pypi.python.org/pypi/Unidecode

Ivie answered 5/1, 2013 at 16:1 Comment(5)
Thanks, this works just fine. But I'm still mad that it's not pointed in the documentation to string.maketrans :) Unfortunately unidecode handles Cyrillic transliteration pretty ugly when it comes to transliteration of certain symbols in russian. My goal is to make URL slugs for (guess who? google of course) from titles written in russian, so I need make a transliteration of a slug that google would "understand". I tried unidecode on one world in russian and fed it to google, - wasn't satisfied with the result - google said "maybe you meant <another word>".Gaul
Some letters were missed, so C-V, C-P approach caused error :) The full russian-alphabet dict version should look like this: ` symbols = (u"абвгдеёжзийклмнопрстуфхцчшщъыьэюяАБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ", u"abvgdeejzijklmnoprstufhzcss_y_euaABVGDEEJZIJKLMNOPRSTUFHZCSS_Y_EUA")` .Freeman
@MInner: this is not going to work with translate, because it can only do one to one replacement.Ivie
@thg435 that is one-to-one. Russian letters just seem to be somehow wider :) That wasn't like criticism, but the thing a spent few minutes on - thinking why everything fails with letter "ч" is in the string.Freeman
@MInner: ok, looks I misunderstood your comment. Would you mind editing this info in, so that others with the same problem have an immediate copypaste solution?Ivie
I
16

Check out the CyrTranslit package, it's specifically made to transliterate from and to Cyrillic script text. It currently supports Serbian, Montenegrin, Macedonian, and Russian.

Example usage:

>>> import cyrtranslit
>>> cyrtranslit.supported()
['me', 'sr', 'mk', 'ru']

>>> cyrtranslit.to_latin('Моё судно на воздушной подушке полно угрей', 'ru')
'Moyo sudno na vozdushnoj podushke polno ugrej'

>>> cyrtranslit.to_cyrillic('Moyo sudno na vozdushnoj podushke polno ugrej')
'Моё судно на воздушной подушке полно угрей'
Incorporation answered 18/2, 2017 at 3:57 Comment(1)
Thank you, I found this very useful.Mechanism
A
4

Here is another short solution with more accurate transliteration:

symbols = (u"абвгдеёжзийклмнопрстуфхцчшщъыьэюяАБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ ",
    (*list(u'abvgdee'), 'zh', *list(u'zijklmnoprstuf'), 'kh', 'z', 'ch', 'sh', 'sh', '',
    'y', '', 'e', 'yu','ya', *list(u'ABVGDEE'), 'ZH', 
    *list(u'ZIJKLMNOPRSTUF'), 'KH', 'Z', 'CH', 'SH', 'SH', *list(u'_Y_E'), 'YU', 'YA', ' '))

coding_dict = {source: dest for source, dest in zip(*symbols)}
translate = lambda x: ''.join([coding_dict[i] for i in x])

text = u'Добро пожаловать'
translate(text)
# 'Dobro pozhalovat'
Arad answered 3/2, 2021 at 11:1 Comment(1)
It still have an issue, if you have any other character than Cyrillic in namePhoton

© 2022 - 2024 — McMap. All rights reserved.