Python 3: reading UCS-2 (BE) file
Asked Answered
P

1

16

I can't seem to be able to decode UCS-2 BE files (legacy stuff) under Python 3.3, using the built-in open() function (stack trace shows UnicodeDecodeError and contains my readLine() method) - in fact, I wasn't able to find a flag for specifying this encoding.

Using Windows 8, terminal is set to codepage 65001, using 'Lucida Console' fonts.

Code snippet won't be of too much help, I guess:

def display_resource():
    f = open(r'D:\workspace\resources\JP.res', encoding=<??tried_several??>)
    while True:
        line = f.readline()
        if len(line) == 0:
            break

Appreciating any insight into this issue.

Piston answered 23/1, 2013 at 20:2 Comment(0)
T
38

UCS-2 is UTF-16, really, for any codepoint that was assigned when it was still called UCS-2 in any case.

Open it with encoding='utf16'. If there is no BOM (the Byte order mark, 2 bytes at the start, for BE that'd be \xfe\xff), then use encoding='utf_16_be' to force a byte order.

Thorma answered 23/1, 2013 at 20:10 Comment(4)
Hello Martijn, I also thought UTF16 should work (based on the same article you linked). And it works, but, just as with utf_16_be, I get on the screen the same character for all Japanese letters - for example "ブラウザー" becomes just a bunch of the same, "unreadable" characters (squares). I should have, again, made the distinction between the two - reading the line, and printing it. Is this also a limitation of the terminal? Going forward, if the reading works fine, and I can work with the strings, can I then write them back to another UCS2 file and get the "right" output in an UCS2-enabledEditor?Piston
It's a limitation of the terminal, I am afraid. Your font does not support those characters; you'll have to find a different font that does. Just because the terminal cannot display them doesn't mean that the data itself has been damaged, so yes, if you encode back to UTF-16 when you write to the file you can open it again with other tools.Thorma
Just wanted to add that I found another limitation of the Lucida Console, maybe it will help someone in the future: when displaying Japanese, Chinese, Arab, Russian, Romanian characters, it will sometimes repeat the last characters from a line - sometimes only the newline, other times as many as 7 - 8 characters. This behavior seems random. Writing to a file these lines, they will show up just right (using the proper encoding - UTF16 in my case).Piston
@elderelder: That'd be a Windows console or font problem indeed.Thorma

© 2022 - 2024 — McMap. All rights reserved.