How to read Chinese files?

Asked 7/4, 2016 at 20:51 Answered 7/4, 2016 at 21:5

I'm stuck with all this confusing encoding stuff. I have a file containing Chinese subs. I actually believe it is UTF-8 because using this in Notepad++ gives me a very good result. If I set gb2312 the Chinese part is still fine, but I will see some UTF8 code not being converted.

The goal is to loop through the text in the file and count how many times the different chars come up.

import os
import re
import io

character_dict = {}
for dirname, dirnames, filenames in os.walk('.'):
    for filename in filenames:
        if "srt" in filename:
            import codecs
            f = codecs.open(filename, 'r', 'gb2312', errors='ignore')
            s = f.read()

            # deleting {}
            s = re.sub('{[^}]+}', '', s)
            # deleting every line that does not start with a chinese char
            s = re.sub(r'(?m)^[A-Z0-9a-z].*\n?', '', s)
            # delete non chinese chars
            s = re.sub(r'[\s\.A-Za-z0-9\?\!\\/\-\"\,\*]', '', s)
            #print s
            s = s.encode('gb2312')
            print s
            for c in s:
                #print c
                pass

This will actually give me the complete Chinese text. But when I print out the loop on the bottom I just get questionmarks instead of the single chars.

Also note I said it is UTF8, but I have to use gb2312 for encoding and as the setting in my gnome-terminal. If I set it to UTF8 in the code i just get trash no matter if I set my terminal to UTF8 or gb2312. So maybe this file is not UTF8 after all!?

In any case s contains the full Chinese text. Why can't I loop it?

Please help me to understand this. It is very confusing for me and the docs are getting me nowhere. And google just leads me to similar problems that somebody solves, but there is no explanation so far that helped me understand this.

Andes answered 7/4, 2016 at 20:51 Comment(6)

so is it gb2312 or UTF-8? If it's UTF-8, why don't you set the encoding in open() rather than gb2312?. As it stands, this question doesn't make much sense – Morava 7/4, 2016 at 21:1

It is not UTF-8, it is GB2313 . Both UTF-8 and GB2313 are encodings, a way to encode characters to bytes. Are you perhaps confusing UTF-8 with the Unicode standard? – Ethben 7/4, 2016 at 21:2

Yes, @MartijnPieters - I know the difference and I also know that they're encodings for different character sets. The OP seems to be using the two encodings/character sets interchangeably > "I actually believe it is UTF-8 because using this in Notepad++ gives me a very good result." – Morava 7/4, 2016 at 21:5

@JasonTS, you shouldn't change the encoding of your terminal without changing your locale. Python will use your locale to work out the encoding to use when calling print. – Morava 7/4, 2016 at 21:8

@AlastairMcCormack: My comment wasn't directed at you. :-) – Ethben 7/4, 2016 at 21:9

@MartijnPieters ooops, sorry :$ – Morava 7/4, 2016 at 21:9

gb2312 is a multi-byte encoding. If you iterate over a bytestring encoded with it, you will be iterating over the bytes, not over the characters you want to be counting (or printing). You probably want to do your iteration on the unicode string before encoding it. If necessary, you can encode the individual codepoints (characters) to their own bytestrings for output:

# don't do s = s.encode('gb2312')
for c in s:      # iterate over the unicode codepoints
    print c.encode('gb2312')  # encode them individually for output, if necessary

Dolores answered 7/4, 2016 at 21:5 Comment(2)

Thank you it finally works! Can I use those as dict keys now? – Andes 7/4, 2016 at 21:11

Yes, you can use the c codepoints you get in the loop as dictionary keys. I suppose you could use encoded versions of them too, but I don't think there's ever going to be a good reason to do so. It's much better to use unicode objects for text everywhere in your program except when you must encode things for to make IO work properly (e.g with reading to or writing files or network data, or printing to the console). – Dolores 7/4, 2016 at 21:59

You are printing individual bytes. GB2312 is a multi-byte encoding, and each codepoint uses 2 bytes. Printing those bytes individually won't produce valid output, no.

The solution is to not encode from Unicode to bytes when printing. Loop over the Unicode string instead:

# deleting {}
s = re.sub('{[^}]+}', '', s)
# deleting every line that does not start with a chinese char
s = re.sub(r'(?m)^[A-Z0-9a-z].*\n?', '', s)
# delete non chinese chars
s = re.sub(r'[\s\.A-Za-z0-9\?\!\\/\-\"\,\*]', '', s)
#print s

# No `s.encode()`!
for char in s:
    print char

You could encode each char chararter individually:

for char in s:
    print char

But if you have your console / IDE / terminal correctly configured you should be able to print directly without errors, especially since your print s.encode('gb2312)` produces correct output.

You also appear to be confusing UTF-8 (an encoding) with the Unicode standard. UTF-8 can be used to represent Unicode data in bytes. GB2312 is an encoding too, and can be used to represent a (subset of) Unicode text in bytes.

You may want to read up on Python and Unicode:

The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Ethben answered 7/4, 2016 at 21:1 Comment(3)

Thank you for your help. However this results in "UnicodeEncodeError: 'ascii' codec can't encode character u'\u73b0' in position 0: ordinal not in range(128)" Any idea what the problem could be? – Andes 7/4, 2016 at 21:4

@JasonTS: What are you printing to? Your terminal appears to be incorrectly configured as only accepting ASCII, yet your print s.encode('gb2312') working suggests it accepts GB2312 instead. – Ethben 7/4, 2016 at 21:7

@JasonTS: you could encode manually by using print char.encode('gb2312') but you'd be better of fixing your terminal locale. Or is this an IDE console or Windows? – Ethben 7/4, 2016 at 21:8

Recommended topics

Hot tags