Python 2 maketrans() function doesn't work with Unicode: "the arguments are different lengths" when they actually are
Asked Answered
D

1

9

[Python 2] SUB = string.maketrans("0123456789","₀₁₂₃₄₅₆₇₈₉")

this code produces the error:

ValueError: maketrans arguments must have same length

I am unsure why this occurs because the strings are the same length. My only idea is that the subscript text length is somehow different than standard size characters but I don't know how to get around this.

Deckard answered 7/5, 2015 at 18:23 Comment(3)
Works fine in Python 3 (which does have much nicer unicode support anyway), is that an option for you?Assignat
currently I'm running python 2.7 but I will be sure to take a look at Python 3Deckard
That Python 3 code is from @ZeroPiraeus' neat answer to "Printing subscript in python"Depicture
D
13

No, the arguments are not the same length:

>>> len("0123456789")
10
>>> len("₀₁₂₃₄₅₆₇₈₉")
30

You are trying to pass in encoded data; I used UTF-8 here, where each digit is encoded to 3 bytes each.

You cannot use str.translate() to map ASCII bytes to UTF-8 byte sequences. Decode your string to unicode and use the slightly different unicode.translate() method; it takes a dictionary instead:

nummap = {ord(c): ord(t) for c, t in zip(u"0123456789", u"₀₁₂₃₄₅₆₇₈₉")}

This creates a dictionary mapping Unicode codepoints (integers), which you can then use on a Unicode string:

>>> nummap = {ord(c): ord(t) for c, t in zip(u"0123456789", u"₀₁₂₃₄₅₆₇₈₉")}
>>> u'99 bottles of beer on the wall'.translate(nummap)
u'\u2089\u2089 bottles of beer on the wall'
>>> print u'99 bottles of beer on the wall'.translate(nummap)
₉₉ bottles of beer on the wall

You can then encode the output to UTF-8 again if you so wish.

From the method documentation:

For Unicode objects, the translate() method does not accept the optional deletechars argument. Instead, it returns a copy of the s where all characters have been mapped through the given translation table which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted.

Discrepancy answered 7/5, 2015 at 18:25 Comment(6)
is there any other way to get subscript characters in python? or even a way to over come this length differenceDeckard
Aaron: this would not be a limitation of Python ... but rather it's an implication of the differences between ASCII and Unicode. There are no "subscript characters" in ASCII. The implications of using Unicode characters is that Python cannot treat such characters as if they were ASCII --- any attempt to do so may work for some cases but will break for others.Bombazine
@Martijn Where did you get 30? I get either 10 or "Unsupported characters in input", depending on where I try it.Assignat
@StefanPochmann: using the interactive interpreter in a terminal configured for UTF-8 use.Discrepancy
Only in Python 2. The length is 30 in Python 2 and 10 in Python 3. OP's code works fine in Python 3.Depicture
@Depicture exactly; you’ll only see this specific error in Python 2 because these are byte strings. That's why the question is tagged with the python-2.x tag.Discrepancy

© 2022 - 2024 — McMap. All rights reserved.