platform specific Unicode semantics in Python 2.7

Asked 29/3, 2012 at 22:56 Answered 23/2, 2017 at 17:20

Ubuntu 11.10:

$ python
Python 2.7.2+ (default, Oct  4 2011, 20:03:08)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = u'\U0001f44d'
>>> len(x)
1
>>> ord(x[0])
128077

Windows 7:

Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x = u'\U0001f44d'
>>> len(x)
2
>>> ord(x[0])
55357

My Ubuntu experience is with the default interpreter in the distribution. For Windows 7 I downloaded and installed the recommended version linked from python.org. I did not compile either of them myself.

The nature of the difference is clear to me. (On Ubuntu the string is a sequence of code points; on Windows 7 a sequence of UTF-16 code units.) My questions are:

Why am I observing this difference in behavior? Is it due to how the interpreter is built, or a difference in dependent system libraries?
Is there any way to configure the behavior of the Windows 7 interpreter to agree with the Ubuntu one, that I can do within Eclipse PyDev (my goal)?
If I have to rebuild, are there any prebuilt Windows 7 interpreters that behave as Ubuntu above from a reliable source?
Are there any workarounds to this issue besides manually counting surrogates in unicode strings on Windows only (blech)?
Does this justify a bug report? Is there any chance such a bug report would be addressed in 2.7?

Haloid answered 29/3, 2012 at 22:56 Comment(2)

we talked about Unicode support at Python WM Saturday. Folks said that the 3.3 IO system has been back-ported to Python 2.7. It is midnight here, but someone might respond tomorrow. Is it v urgent? – Trucking 29/3, 2012 at 23:17

I've already started coding a workaround, so no, not urgent. Disappointing that as late as 3.2 the Unicode string type in Python has wrinkles like this... – Haloid 29/3, 2012 at 23:20

On Ubuntu, you have a "wide" Python build where strings are UTF-32/UCS-4. Unfortunately, this isn't (yet) available for Windows.

Windows builds will be narrow for a while based on the fact that there have been few requests for wide characters, those requests are mostly from hard-core programmers with the ability to buy their own Python and Windows itself is strongly biased towards 16-bit characters.

Python 3.3 will have flexible string representation, in which you will not need to care about whether Unicode strings use 16-bit or 32-bit code units.

Until then, you can get the code points from a UTF-16 string with

def code_points(text):
    utf32 = text.encode('UTF-32LE')
    return struct.unpack('<{}I'.format(len(utf32) // 4), utf32)

Campbellbannerman answered 29/3, 2012 at 23:12 Comment(0)

great question! i fell down this rabbit hole recently myself.

@dan04's answer inspired me to expand it into a unicode subclass that provides consistent indexing, slicing, and len() on both narrow and wide Python 2 builds:

class WideUnicode(unicode):
  """String class with consistent indexing, slicing, len() on both narrow and wide Python."""
  def __init__(self, *args, **kwargs):
    super(WideUnicode, self).__init__(*args, **kwargs)
    # use UTF-32LE to avoid a byte order marker at the beginning of the string
    self.__utf32le = unicode(self).encode('utf-32le')

  def __len__(self):
    return len(self.__utf32le) / 4

  def __getitem__(self, key):
    length = len(self)

    if isinstance(key, int):
      if key >= length:
        raise IndexError()
      key = slice(key, key + 1)

    if key.stop is None:
      key.stop = length

    assert key.step is None

    return WideUnicode(self.__utf32le[key.start * 4:key.stop * 4]
                       .decode('utf-32le'))

  def __getslice__(self, i, j):
    return self.__getitem__(slice(i, j))

open sourced here, public domain. example usage:

text = WideUnicode(obj.text)
for tag in obj.tags:
  text = WideUnicode(text[:start] + tag.text + text[end:])

(simplified from this usage.)

thanks @dan04!

Multimillionaire answered 23/2, 2017 at 17:20 Comment(0)

I primarily needed to accurately test length. Hence this function that correctly returns the codepoint length of any unicode string, whether the interpreter is narrow or wide built. If the data uses two surrogate literals instead of a single \U-style code point in a wide-built interpreter, the returned codepoint length will account for that as long as the surrogates are used "correctly", i.e. as a narrow-built interpreter would use them.

invoke = lambda f: f()  # trick borrowed from Node.js

@invoke
def ulen():
  testlength = len(u'\U00010000')
  assert (testlength == 1) or (testlength == 2)
  if testlength == 1:  # "wide" interpreters
    def closure(data):
      u'returns the number of Unicode code points in a unicode string'
      return len(data.encode('UTF-16BE').decode('UTF-16BE'))
  else:  # "narrow" interpreters
    def filt(c):
      ordc = ord(c)
      return (ordc >= 55296) and (ordc < 56320)
    def closure(data):
      u'returns the number of Unicode code points in a unicode string'
      return len(data) - len(filter(filt, data))
  return closure  # ulen() body is therefore different on narrow vs wide builds

Test case, passes on narrow and wide builds:

class TestUlen(TestCase):

  def test_ulen(self):
    self.assertEquals(ulen(u'\ud83d\udc4d'), 1)
    self.assertEquals(ulen(u'\U0001F44D'), 1)

Haloid answered 19/4, 2012 at 18:59 Comment(0)

Recommended topics

Hot tags