decoding shift-jis: "illegal multibyte sequence"
Asked Answered
B

2

8

I'm trying to decode a shift-jis encoded string, like this:

string.decode('shift-jis').encode('utf-8')

to be able to view it in my program.

When I come across 2 shift-jis characters, in hex "0x87 0x54" and "0x87 0x55", I get this error:

UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position 12-13: illegal multibyte sequence

But I'm sure they are valid shift-jis characters: http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml

I've also noticed that those characters appear as black boxes in my shift-jis text editor, which means they are not recognized. So there's something special about these two chars that made my editor and Python decoder fail. Help?

(sorry, I couldn't post an example string because when those characters are present, it doesn't get added to the clipboard from there onward and also gets converted to unicode automatically. I posted the hex values for them though.)

Bircher answered 18/7, 2011 at 5:44 Comment(0)
C
10

Multiple versions of Shift JIS exist. The shift_jis codec is JIS X 0208, whereas that table is JIS X 0213, corresponding to the shift_jisx0213 codec.

>>> u'⑲⑳Ⅰ'.encode('shift_jisx0213')
'\x87R\x87S\x87T'
Cellulosic answered 18/7, 2011 at 6:2 Comment(3)
Add some infomation on Shift-JIS encoding.Necrophilia
It basically works. However if the source text is Shift-JIS generated by Windows includes 0x80, try 'cp932' instead.Sadiras
Actually very few Japanese text use 'Shift_JISX0213'. It is not used in Windows (ja-JP), which is the last popular environment reading&writing Shift_JIS. See details in my answer.Aerometry
A
4

You should never use shift_jisx0213. It has never been used for actual production purposes. Windows cannot handle it. The character set JIS X 0213 is used with Unicode in most cases but not with Shift_JIS encoding.

Use 'cp932' (in Python 3).

./sjis.txt contains

5c  7e  87  52  87  53  87  54  87  8a  fa  b1  fb  50  fb  fc

(They are \~⑲⑳Ⅰ㈱﨑瀨髙 saved on Windows 10)

>>> import codecs
>>> codecs.open('sjis.txt',"rb",'shift_jis').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/codecs.py", line 700, in read
    return self.reader.read(size)
UnicodeDecodeError: 'shift_jis' codec can't decode byte 0x87 in position 2: illegal multibyte sequence
>>> codecs.open('sjis.txt',"rb",'shift_jisx0213').read()
'¥‾⑲⑳Ⅰ㈱郫鍚騠'
>>> codecs.open('sjis.txt',"rb",'cp932').read()
'\\~⑲⑳Ⅰ㈱﨑瀨髙'

shift_jisx0213 decodes symbols and the last three kanjis incorrectly.

Aerometry answered 22/3, 2019 at 11:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.