UnicodeDecodeError when redirecting to file
Asked Answered
A

3

103

I run this snippet twice, in the Ubuntu terminal (encoding set to utf-8), once with ./test.py and then with ./test.py >out.txt:

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

Without redirection it prints garbage. With redirection I get a UnicodeDecodeError. Can someone explain why I get the error only in the second case, or even better give a detailed explanation of what's going on behind the curtain in both cases?

Argue answered 28/12, 2010 at 11:24 Comment(3)
This answer might be of help too.Harless
When I try to replicate your finding, I get a UnicodeEncodeError, not a UnicodeDecodeError. gist.github.com/jaraco/12abfc05872c65a4f3f6cd58b6f9be4dJosiejosler
Try this answer,three solutions: https://mcmap.net/q/25733/-error-occurs-when-trying-to-redirect-python-utf-8-stdout-to-a-file-on-windowsCrepe
O
258

The whole key to such encoding problems is to understand that there are in principle two distinct concepts of "string": (1) string of characters, and (2) string/array of bytes. This distinction has been mostly ignored for a long time because of the historic ubiquity of encodings with no more than 256 characters (ASCII, Latin-1, Windows-1252, Mac OS Roman,…): these encodings map a set of common characters to numbers between 0 and 255 (i.e. bytes); the relatively limited exchange of files before the advent of the web made this situation of incompatible encodings tolerable, as most programs could ignore the fact that there were multiple encodings as long as they produced text that remained on the same operating system: such programs would simply treat text as bytes (through the encoding used by the operating system). The correct, modern view properly separates these two string concepts, based on the following two points:

  1. Characters are mostly unrelated to computers: one can draw them on a chalk board, etc., like for instance بايثون, 中蟒 and 🐍. "Characters" for machines also include "drawing instructions" like for example spaces, carriage return, instructions to set the writing direction (for Arabic, etc.), accents, etc. A very large character list is included in the Unicode standard; it covers most of the known characters.

  2. On the other hand, computers do need to represent abstract characters in some way: for this, they use arrays of bytes (numbers between 0 and 255 included), because their memory comes in byte chunks. The necessary process that converts characters to bytes is called encoding. Thus, a computer requires an encoding in order to represent characters. Any text present on your computer is encoded (until it is displayed), whether it be sent to a terminal (which expects characters encoded in a specific way), or saved in a file. In order to be displayed or properly "understood" (by, say, the Python interpreter), streams of bytes are decoded into characters. A few encodings (UTF-8, UTF-16,…) are defined by Unicode for its list of characters (Unicode thus defines both a list of characters and encodings for these characters—there are still places where one sees the expression "Unicode encoding" as a way to refer to the ubiquitous UTF-8, but this is incorrect terminology, as Unicode provides multiple encodings).

In summary, computers need to internally represent characters with bytes, and they do so through two operations:

Encoding: characters → bytes

Decoding: bytes → characters

Some encodings cannot encode all characters (e.g., ASCII), while (some) Unicode encodings allow you to encode all Unicode characters. The encoding is also not necessarily unique, because some characters can be represented either directly or as a combination (e.g. of a base character and of accents).

Note that the concept of newline adds a layer of complication, since it can be represented by different (control) characters that depend on the operating system (this is the reason for Python's universal newline file reading mode).


Some more information on Unicode, characters and code points, if you are interested:

Now, what I have called "character" above is what Unicode calls a "user-perceived character". A single user-perceived character can sometimes be represented in Unicode by combining character parts (base character, accents,…) found at different indexes in the Unicode list, which are called "code points"—these codes points can be combined together to form a "grapheme cluster". Unicode thus leads to a third concept of string, made of a sequence of Unicode code points, that sits between byte and character strings, and which is closer to the latter. I will call them "Unicode strings" (like in Python 2).

While Python can print strings of (user-perceived) characters, Python non-byte strings are essentially sequences of Unicode code points, not of user-perceived characters. The code point values are the ones used in Python's \u and \U Unicode string syntax. They should not be confused with the encoding of a character (and do not have to bear any relationship with it: Unicode code points can be encoded in various ways).

This has an important consequence: the length of a Python (Unicode) string is its number of code points, which is not always its number of user-perceived characters: thus s = "\u1100\u1161\u11a8"; print(s, "len", len(s)) (Python 3) gives 각 len 3 despite s having a single user-perceived (Korean) character (because it is represented with 3 code points—even if it does not have to, as print("\uac01") shows). However, in many practical circumstances, the length of a string is its number of user-perceived characters, because many characters are typically stored by Python as a single Unicode code point.

In Python 2, Unicode strings are called… "Unicode strings" (unicode type, literal form u"…"), while byte arrays are "strings" (str type, where the array of bytes can for instance be constructed with string literals "…"). In Python 3, Unicode strings are simply called "strings" (str type, literal form "…"), while byte arrays are "bytes" (bytes type, literal form b"…"). As a consequence, something like "🐍"[0] gives a different result in Python 2 ('\xf0', a byte) and Python 3 ("🐍", the first and only character).

With these few key points, you should be able to understand most encoding related questions!


Normally, when you print u"…" to a terminal, you should not get garbage: Python knows the encoding of your terminal. In fact, you can check what encoding the terminal expects:

% python
Python 2.7.6 (default, Nov 15 2013, 15:20:37) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.stdout.encoding
UTF-8

If your input characters can be encoded with the terminal's encoding, Python will do so and will send the corresponding bytes to your terminal without complaining. The terminal will then do its best to display the characters after decoding the input bytes (at worst the terminal font does not have some of the characters and will print some kind of blank instead).

If your input characters cannot be encoded with the terminal's encoding, then it means that the terminal is not configured for displaying these characters. Python will complain (in Python with a UnicodeEncodeError since the character string cannot be encoded in a way that suits your terminal). The only possible solution is to use a terminal that can display the characters (either by configuring the terminal so that it accepts an encoding that can represent your characters, or by using a different terminal program). This is important when you distribute programs that can be used in different environments: messages that you print should be representable in the user's terminal. Sometimes it is thus best to stick to strings that only contain ASCII characters.

However, when you redirect or pipe the output of your program, then it is generally not possible to know what the input encoding of the receiving program is, and the above code returns some default encoding: None (Python 2.7) or UTF-8 (Python 3):

% python2.7 -c "import sys; print sys.stdout.encoding" | cat
None
% python3.4 -c "import sys; print(sys.stdout.encoding)" | cat
UTF-8

The encoding of stdin, stdout and stderr can however be set through the PYTHONIOENCODING environment variable, if needed:

% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat
UTF-8

If the printing to a terminal does not produce what you expect, you can check the UTF-8 encoding that you put manually in is correct; for instance, your first character (\u001A) is not printable, if I'm not mistaken.

At http://wiki.python.org/moin/PrintFails, you can find a solution like the following, for Python 2.x:

import codecs
import locale
import sys

# Wrap sys.stdout into a StreamWriter to allow writing unicode.
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout) 

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

For Python 3, you can check one of the questions asked previously on StackOverflow.

Oligosaccharide answered 28/12, 2010 at 12:44 Comment(11)
@singularity: Thanks! I added some info for Python 3.Oligosaccharide
Thank you, man! I needed this explanation for such a long time... It's a pity that I can give you only one upvote.Seemly
I am glad to have been of help, @m01! One of the motivations for writing this answer was that there were many pages on the web about Unicode and Python, but I found that despite being interesting, they never completely allowed me to solve concrete encoding problems… I truly believe that by keeping in mind the principles found in this answer and taking the time to use them when solving concrete encoding problems helps a lot.Oligosaccharide
This is hands down the best explanation of unicode and python ever. The Python Unicode HOWTO should be replaced with this.Tedtedd
Here, let me draw the “right-to-left override” character on this chalkboard…Siloa
@icktoofay: Interesting point, thank you. This Unicode character is nonetheless an instruction about how to draw characters, though. I amended my answer to reflect the subtlety that you described better than with the "etc." that was used instead before.Oligosaccharide
it is very good explanation but it seems you've mixed user-perceived characters (grapheme clusters in Unicode) that you call just "characters" and Unicode codepoints (a single user-perceived character may be represented using multiple Unicode codepoints). str type in Python 3 represents an immutable sequence of Unicode codepoints, not user-perceived characters. Unrelated: for people who landed here due to the question title, you could put PYTHONIOENCODING example near the top of your answer. Also, OS may provide Unicode API e.g., WriteConsoleW() on Windows (no encoding is necessary).Vaunt
@J.F.Sebastian Good points. I will include the distinction between user-perceived characters and Unicode codepoints.Oligosaccharide
this tip saved me just when I was about to lose my sanity. I thought, it was my newly installed font !Obsess
For python3 and windows command line the trick was using setting the encoding before. i.e set PYTHONIOENCODING=utf-8:surrogateescape and then run the program. taken from https://mcmap.net/q/25728/-how-to-set-sys-stdout-encoding-in-python-3Cosmonautics
Except for the surrogateescape option, this is precisely illustrated at the end of the second part, right?Oligosaccharide
A
21

Python always encodes Unicode strings when writing to a terminal, file, pipe, etc. When writing to a terminal Python can usually determine the encoding of the terminal and use it correctly. When writing to a file or pipe Python defaults to the 'ascii' encoding unless explicitly told otherwise. Python can be told what to do when piping output through the PYTHONIOENCODING environment variable. A shell can set this variable before redirecting Python output to a file or pipe so the correct encoding is known.

In your case you've printed 4 uncommon characters that your terminal didn't support in its font. Here's some examples to help explain the behavior, with characters that are actually supported by my terminal (which uses cp437, not UTF-8).

Example 1

Note that the #coding comment indicates the encoding in which the source file is saved. I chose utf8 so I could support characters in source that my terminal could not. Encoding redirected to stderr so it can be seen when redirected to a file.

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ'
print >>sys.stderr,sys.stdout.encoding
print uni

Output (run directly from terminal)

cp437
αßΓπΣσµτΦΘΩδ∞φ

Python correctly determined the encoding of the terminal.

Output (redirected to file)

None
Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)

Python could not determine encoding (None) so used 'ascii' default. ASCII only supports converting the first 128 characters of Unicode.

Output (redirected to file, PYTHONIOENCODING=cp437)

cp437

and my output file was correct:

C:\>type out.txt
αßΓπΣσµτΦΘΩδ∞φ

Example 2

Now I'll throw in a character in the source that isn't supported by my terminal:

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ马' # added Chinese character at end.
print >>sys.stderr,sys.stdout.encoding
print uni

Output (run directly from terminal)

cp437
Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
  File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u9a6c' in position 14: character maps to <undefined>

My terminal didn't understand that last Chinese character.

Output (run directly, PYTHONIOENCODING=437:replace)

cp437
αßΓπΣσµτΦΘΩδ∞φ?

Error handlers can be specified with the encoding. In this case unknown characters were replaced with ?. ignore and xmlcharrefreplace are some other options. When using UTF8 (which supports encoding all Unicode characters) replacements will never be made, but the font used to display the characters must still support them.

Ardath answered 29/12, 2010 at 2:24 Comment(5)
It is not exactly true that "When writing to a file or pipe Python defaults to the 'ascii' encoding unless explicitly told otherwise.". In fact, Python 3 uses UTF-8, on Mac OS X/Fink.Oligosaccharide
Yes, Python 3 defaults to 'utf8', but based on the OP's sample, he's using Python 2.X, which defaults to 'ascii'.Ardath
I could not get correct output by manipulating PYTHONIOENCODING. Doing print string.encode("UTF-8") as suggested by @Ismail worked for me.Crowd
you can see Chinese characters if your font supports them even if chcp codepage does not support them. To avoid UnicodeEncodeError: 'charmap', you could install win-unicode-console package.Vaunt
My problem is that python-gitlab CLI prints Chinese characters well in cmd but the characters are garbage after being redirected into files. PYTHONIOENCODING=utf-8 solves the problem.Unapt
M
12

Encode it while printing

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni.encode("utf-8")

This is because when you run the script manually python encodes it before outputting it to terminal, when you pipe it python does not encode it itself so you have to encode manually when doing I/O.

Messaline answered 28/12, 2010 at 11:30 Comment(10)
It still does not answer the question WTH is going on here. Why, out of the blue it decides to encode only when redirected, when this is supposed to be completely transparent to the process.Meuse
Why doesn't python encode it when performing redirection? Does python explicitly check and decide that it'll do things differently just to be difficult?Hutton
Shell intercepts the pipe, Python would have to check if stdout is a pipe.Messaline
does python even have a way to distinguish the two situations? I thougt (until now...) that there's no way it can know.Argue
Python can check if the output is a terminal, if its outputting to a pipe, then terminal type will be "dumb". I guess "dumb" should tell you why Python doesn't try to do anything automatical in this case, it can fail.Messaline
@Ismail If I understand it correctly, quite the opposite is going on here: it tries to do something (and fails) when trying to output to the pipe.Meuse
@maksymko, no, it doesn't do anything when you do pipe so its trying to interpret data it can't because its not encoded. The problem here is that when its outputting to terminally it does the work for you.Messaline
@Ismail, ah, I think I understand it now, thanks. Still, pretty strange behavior, if you ask me.Meuse
@maksymko the rule of thumb is, always use UTF-8 internally and encode it when doing I/O.Messaline
it produces mojibake if the environment uses a character encoding that is incompatible with utf-8 (e.g., it is common on Windows). Don't hardcode the character encoding of your environment inside your script. Configure your locale, or PYTHONIOENCODING, or install win-unicode-console (Windows), or accept a command-line parameter (if you must).Vaunt

© 2022 - 2024 — McMap. All rights reserved.