Endian-ness only applies to multi-byte words, but UTF-8 uses units of 8 bits to encode information (that's what the 8 in the name stands for). There never is the question of confusion of ordering there.
Sometimes it may need more than one of those units to encode information, but they are considered distinct. The letter A
is one byte, 0x41
, for example. When it has to encode a character with more bytes, it uses a leading indicator byte, followed by extra continuation bytes to capture all the information needed for that character. Logically, these are distinct units.
GBK uses a similar scheme; characters use units of 1 byte, and just like UTF-8, a second byte can be used for some of the characters.
UCS-2 (and it's successor, UTF-16) on the other hand, is a 2-byte format. It encodes information in units of 16 bits, and those 16 bits always go together. The 2 bytes in that unit belong together logically, and modern architectures treat these as one unit, and thus have made a decision in what order they are stored. That's where endianess comes in, the order of the 2 bytes in a unit is architecture dependant. In your architecture, the bytes are ordered using little-endianess, meaning that the 'smaller' byte goes first. This is why the 0x4F
byte comes before the 0x60
byte in your file.
Note that python can read either big or little endian UTF-16 just fine; you can pick the endianess explicitly if there is no indicator character at the start (the Byte Order Mark, or BOM):
>>> '`O\n\x00'.decode('utf-16')
u'\u4f60\n'
>>> '`O\n\x00'.decode('utf-16-le')
u'\u4f60\n'
>>> 'O`\x00\n'.decode('utf-16-be')
u'\u4f60\n'
In the latter example the bytes have been reversed, and decoded as big-endian.