UnicodeEncodeError in python3 when redirection is used
Asked Answered
R

3

4

What I want to do: extract text information from a pdf file and redirect that to a txt file.

What I did:

pip install pdfminor

pdf2txt.py file.pdf > output.txt

What I got:

UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 0: illegal multibyte sequence

My observation:

\u2022 is bullet point, .

pdf2txt.py works well without redirection: the bullet point character is written to stdout without any error.

My question:

Why does redirection cause a python error? As far as I know, redirection is a O.S. job, and it is simply copying things after the program is finished.

How can I fix this error? I cannot do any modification to pdf2txt.py as it's not my code.

Reikoreilly answered 17/1, 2020 at 0:14 Comment(6)
Python needs to know what encoding to use for output. It can choose a different encoding depending on whether the output is going to a terminal or a file.Iluminadailwain
Ok, thank you Mark, any suggestion on how to fix it?Reikoreilly
I think there's an environment variable that affects it, but I don't have time now to look it up.Iluminadailwain
It's fine, I can wait for other people to help me. Thanks a lot for answering me.Reikoreilly
normally Python gets encoding used by terminal to encode text before send to terminal but when you redirect then it can't get encoding from terminal - you would have to set encoding manually in python script - probably in every print()Caraway
BTW: using Google python redirect utf-8 I found UnicodeDecodeError when redirecting to file on stackoverflow. Use Google to find more.Caraway
I
3

Redirection causes an error because the default encoding used by Python does not support one of the characters you're trying to output. In your case you're trying to output the bullet character using the GBK codec. This probably means you're using a Chinese version of Windows.

A version of Python 3.6 or later will work fine outputting to the terminal window on Windows, because character encoding is bypassed completely using Unicode. It's only when redirecting the output to a file that the Unicode must be encoded to a byte stream.

You can set the environment variable PYTHONIOENCODING to change the encoding used for stdio. If you use UTF-8 it will be guaranteed to work with any Unicode character.

set PYTHONIOENCODING=utf-8
pdf2txt.py file.pdf > output.txt
Iluminadailwain answered 20/1, 2020 at 20:11 Comment(1)
Yeah, true, I'm using Chinese version windows. Your suggestion works perfectly for me, thank you very much.Reikoreilly
J
0

You seem to have somehow obtained unicode characters from the raw bytes but you need to encode it. I recommend you to use UTF-8 encoding for txt files.

Making the encoding parameter more explicit is probably what you want.

def gbk_to_utf8(source, target):
    with open(source, "r", encoding="gbk") as src: 
        with open(target, "w", encoding="utf-8") as dst: 
            for line in src.readlines():
                dst.write(line)
Jacinda answered 17/1, 2020 at 3:4 Comment(2)
Thanks you, but as I said, I can't modify that python file...Reikoreilly
@NevilleZong according to the question you are running the Python source code directly. Not sure what prevents you from making a copy of pdf2txt.py and changing it.Iluminadailwain
P
0

I'm not sure if this is a good idea, but I prompted AI to fix this problem for me with a comment "set stdout encoding to utf-8" it it produced the following code, which seems to work if I run it near the beginning of my script:

sys.stdout = open(sys.stdout.fileno(), mode='w', encoding='utf-8', buffering=1)
Provisory answered 19/3 at 20:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.