Ubuntu 11.10:
$ python
Python 2.7.2+ (default, Oct 4 2011, 20:03:08)
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = u'\U0001f44d'
>>> len(x)
1
>>> ord(x[0])
128077
Windows 7:
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x = u'\U0001f44d'
>>> len(x)
2
>>> ord(x[0])
55357
My Ubuntu experience is with the default interpreter in the distribution. For Windows 7 I downloaded and installed the recommended version linked from python.org. I did not compile either of them myself.
The nature of the difference is clear to me. (On Ubuntu the string is a sequence of code points; on Windows 7 a sequence of UTF-16 code units.) My questions are:
- Why am I observing this difference in behavior? Is it due to how the interpreter is built, or a difference in dependent system libraries?
- Is there any way to configure the behavior of the Windows 7 interpreter to agree with the Ubuntu one, that I can do within Eclipse PyDev (my goal)?
- If I have to rebuild, are there any prebuilt Windows 7 interpreters that behave as Ubuntu above from a reliable source?
- Are there any workarounds to this issue besides manually counting surrogates in
unicode
strings on Windows only (blech)? - Does this justify a bug report? Is there any chance such a bug report would be addressed in 2.7?