Setting the correct encoding when piping stdout in Python
Asked Answered
A

12

374

When piping the output of a Python program, the Python interpreter gets confused about encoding and sets it to None. This means a program like this:

# -*- coding: utf-8 -*-
print u"åäö"

will work fine when run normally, but fail with:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)

when used in a pipe sequence.

What is the best way to make this work when piping? Can I just tell it to use whatever encoding the shell/filesystem/whatever is using?

The suggestions I have seen thus far is to modify your site.py directly, or hardcoding the defaultencoding using this hack:

# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
print u"åäö"

Is there a better way to make piping work?

Amin answered 29/1, 2009 at 16:57 Comment(7)
See also #4546161Pontonier
If you have this problem on windows, you can also run chcp 65001 before executing your script. This can have issues, but it often helps, and doesn't require a lot of typing (less than set PYTHONIOENCODING=utf_8).Pearsall
chcp command is not the same as setting PYTHONIOENCODING. I think chcp is just configuration for the terminal itself and has nothing to do with writing to a file (which is what you are doing when piping stdout). Try setx PYTHONENCODING utf-8 to make it permanent if you want to save typing.Connolly
#48783029Sharpset
I faced a somewhat related issue, and found a solution here --> #48783029Sharpset
@Tomasz, Great! Your environment variable, is the simplest and thus the best solution to overcoming this annoying thing!Agility
For Python 3, see Printing to stdout with encoding in Python 3 - Stack OverflowLoggins
B
171

Your code works when run in an script because Python encodes the output to whatever encoding your terminal application is using. If you are piping you must encode it yourself.

A rule of thumb is: Always use Unicode internally. Decode what you receive, and encode what you send.

# -*- coding: utf-8 -*-
print u"åäö".encode('utf-8')

Another didactic example is a Python program to convert between ISO-8859-1 and UTF-8, making everything uppercase in between.

import sys
for line in sys.stdin:
    # Decode what you receive:
    line = line.decode('iso8859-1')

    # Work with Unicode internally:
    line = line.upper()

    # Encode what you send:
    line = line.encode('utf-8')
    sys.stdout.write(line)

Setting the system default encoding is a bad idea, because some modules and libraries you use can rely on the fact it is ASCII. Don't do it.

Bloodstained answered 29/1, 2009 at 18:3 Comment(9)
The problem is that the user doesn't want to specify encoding explicitly. He wants just use Unicode for IO. And the encoding he uses should be an encoding specified in locale settings, not in terminal application settings. AFAIK, Python 3 uses a locale encoding in this case. Changing sys.stdout seems like a more pleasant way.Passerine
Encoding / decoding every string excplictly is bound to cause bugs when a encode or decode call is missing or added once to much somewhere. The output encoding can be set when output is a terminal, so it can be set when output is not a terminal. There is even a standard LC_CTYPE environment to specify it. It is a but in python that it doesn't respect this.Forbis
@Rasmus Kaj: If you consistently use a defined function for output you can be sure that it won't be missing or duplicated. Output encoding can't be "set". Accepting only unicode on sys.stdout (by replacing it with codecs.getwriter) breaks a lot of libraries in practice.Bloodstained
This answer is wrong. You should not be manually converting on each input and output of your program; that's brittle and completely unmaintainable.Aweinspiring
@Glenn Maynard : so what is IYO the right answer? It's more helpful to tell us than just say 'This answer is wrong'Stigmatize
What libraries relies on stdout to only accept ASCII? Considering the amount of data that is not 7-bit ASCII that seems to be a very bad idea.Haematopoiesis
@ErikJohansson: it is not about stdout accepting whatever encoding. sys.getdefaultencoding() is used in many places e.g., "а" + u"a" expression uses it. Changing sys.getdefaultencoding() may introduce data-dependent bugs that might corrupt your data silently.March
@smci: the answer is don't modify your script, set PYTHONIOENCODING if you are redirecting script's stdout in Python 2.March
@Glenn Maynard Actually decoding and encoding is a good practice, from the python doc: "Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end."Epaulet
L
168

First, regarding this solution:

# -*- coding: utf-8 -*-
print u"åäö".encode('utf-8')

It's not practical to explicitly print with a given encoding every time. That would be repetitive and error-prone.

A better solution is to change sys.stdout at the start of your program, to encode with a selected encoding. Here is one solution I found on Python: How is sys.stdout.encoding chosen?, in particular a comment by "toka":

import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
Loggins answered 23/7, 2009 at 2:5 Comment(6)
unfortunately, changing sys.stdout to accept only unicode breaks a lot of libraries that expect it to accept encoded bytestrings.Bloodstained
nosklo: Then how can it work reliably and automaticly when output is a terminal?Forbis
@Rasmus Kaj: just define your own unicode printing function and use it every time you want to print unicode: def myprint(unicodeobj): print unicodeobj.encode('utf-8') -- you automatically detect terminal encoding by inspecting sys.stdout.encoding, but you should consider the case where it is None (i.e. when redirecting output to a file) so you need a separate function anyway.Bloodstained
@nosklo: This does not make sys.stdout accept only Unicode. You can pass both str and unicode to a StreamWriter.Aweinspiring
And it'll screw any readline capabilities of pdb or I guess IPython as @JohnChain stated it.Zedekiah
I assume this answer was intended for python2. Be careful with this on code which is intended to support both python2 and python3. For me it's breaking stuff when ran under python3.Ada
I
141

You may want to try changing the environment variable "PYTHONIOENCODING" to "utf_8". I have written a page on my ordeal with this problem.

Tl;dr of the blog post:

import sys, locale, os
print(sys.stdout.encoding)
print(sys.stdout.isatty())
print(locale.getpreferredencoding())
print(sys.getfilesystemencoding())
print(os.environ["PYTHONIOENCODING"])
print(chr(246), chr(9786), chr(9787))

gives you

utf_8
False
ANSI_X3.4-1968
ascii
utf_8
ö ☺ ☻
Impenetrability answered 26/10, 2010 at 20:30 Comment(7)
Changing sys.stdout.encoding maybe does not work, but changing sys.stdout does work: sys.stdout = codecs.getwriter(encoding)(sys.stdout). This can be done from within the python program, so the user is not forced to set an env variable.Glycol
@jeckyll2hide: PYTHONIOENCODING does work. How bytes are interpreted as a text is defined by user environment. Your script shouldn't be assuming and dictate the user environment what character encoding to use. If Python doesn't pick up the settings automatically then PYTHONIOENCODING can be set for your script. You shouldn't need it unless the output is redirected to a file/pipe.March
+1. Honestly I think it's a Python bug. When I redirect output I want those same bytes that would be on the terminal, but in a file. Maybe it's not for everyone but it's a good default. Crashing hard with no explanation on a trivial operation that usually "just works" is a bad default.Calvinism
@SnakE: the only way I can rationalize why Python's implementation intentionally would enforce an iron-clad and permanent choice of encoding on stdout at startup time, might be in order to prevent any badly encoded stuff coming out later on. Or changing it is just an unimplemented feature, in which case allowing the user to change it later on would be a reasonable Python feature request.Impenetrability
@Impenetrability My point is, behavior of my program should not depend on whether it is redirected or not---unless I really want it, in which case I implement it myself. Python behaves contrary to my experience with any other console tools. This violates the least surprise principle. I consider this a design flaw unless there is a very strong rationale.Calvinism
@SnakE: Yeah you have a good point. I looked at stackoverflow.com/questions/4545661 and an informative example is that I get different outputs for python -c "import sys; print(sys.stdout.encoding)" and python -c "import sys; print(sys.stdout.encoding)" I read about the isatty function and that was clarifying too; I guess some programs benefit a lot from knowing what kind of output they have, but the flipside is that there's more state than ideal sometimes in there.Impenetrability
This answer solved it for me. Thanks to it, I noticed that it's the caller script/shell of the Python script which should set UTF8. In my case it was a shell_exec() from PHP and putenv('LANG=en_US.UTF-8'); solved it.Quianaquibble
A
64
export PYTHONIOENCODING=utf-8

do the job, but can't set it on python itself ...

what we can do is verify if isn't setting and tell the user to set it before call script with :

if __name__ == '__main__':
    if (sys.stdout.encoding is None):
        print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout."
        exit(1)

Update to reply to the comment: the problem just exist when piping to stdout . I tested in Fedora 25 Python 2.7.13

python --version
Python 2.7.13

cat b.py

#!/usr/bin/env python
#-*- coding: utf-8 -*-
import sys

print sys.stdout.encoding

running ./b.py

UTF-8

running ./b.py | less

None
Annelleannemarie answered 15/6, 2011 at 18:40 Comment(3)
That check doesn't work in Python 2.7.13. sys.stdout.encoding is automatically set based on the LC_CTYPE locale value.Jansenism
mail.python.org/pipermail/python-list/2011-June/605938.html the example there still work , i.e. when you use ./a.py > out.txt sys.stdout.encoding is NoneStringboard
I had a similar problem with a sync script from Backblaze B2 and export PYTHONIOENCODING=utf-8 solved my problem. Python 2.7 on Debian Stretch.Table
B
8

I'm surprised this answer has not been posted here yet

Since Python 3.7 you can change the encoding of standard streams with reconfigure():

sys.stdout.reconfigure(encoding='utf-8')

You can also modify how encoding errors are handled by adding an errors parameter.

https://mcmap.net/q/25728/-how-to-set-sys-stdout-encoding-in-python-3

Byrd answered 21/3, 2022 at 21:27 Comment(0)
T
7

Since Python 3.7, we can use Python UTF-8 Mode, by using command line option -X utf8:

 python -X utf8 testzh.py

The script testzh.py contains

print("Content-type: text/html; charset=UTF-8\n") 
print("地球你好!")

To set Windows 10 Internet Service IIS as CGI Script handler,

We set Executable as this:

"C:\Program Files\Python39\python.exe" -X utf8 %s

enter image description here

This works for Chinese Ideograms as expected on Browser Microsoft.Edge like this screenshot: Otherwise, error occurs.

enter image description here

Please see https://docs.python.org/3/library/os.html#utf8-mode

Trimerous answered 11/10, 2021 at 11:48 Comment(0)
S
5

I had a similar issue last week. It was easy to fix in my IDE (PyCharm).

Here was my fix:

Starting from PyCharm menu bar: File -> Settings... -> Editor -> File Encodings, then set: "IDE Encoding", "Project Encoding" and "Default encoding for properties files" ALL to UTF-8 and she now works like a charm.

Hope this helps!

Shelly answered 21/6, 2015 at 2:54 Comment(0)
T
4

An arguable sanitized version of Craig McQueen's answer.

import sys, codecs
class EncodedOut:
    def __init__(self, enc):
        self.enc = enc
        self.stdout = sys.stdout
    def __enter__(self):
        if sys.stdout.encoding is None:
            w = codecs.getwriter(self.enc)
            sys.stdout = w(sys.stdout)
    def __exit__(self, exc_ty, exc_val, tb):
        sys.stdout = self.stdout

Usage:

with EncodedOut('utf-8'):
    print u'ÅÄÖåäö'
Tie answered 13/4, 2015 at 10:24 Comment(0)
C
3

I just thought I'd mention something here which I had to spent a long time experimenting with before I finally realised what was going on. This may be so obvious to everyone here that they haven't bothered mentioning it. But it would've helped me if they had, so on that principle...!

NB: I am using Jython specifically, v 2.7, so just possibly this may not apply to CPython...

NB2: the first two lines of my .py file here are:

# -*- coding: utf-8 -*-
from __future__ import print_function

The "%" (AKA "interpolation operator") string construction mechanism causes ADDITIONAL problems too... If the default encoding of the "environment" is ASCII and you try to do something like

print( "bonjour, %s" % "fréd" )  # Call this "print A"

You will have no difficulty running in Eclipse... In a Windows CLI (DOS window) you will find that the encoding is code page 850 (my Windows 7 OS) or something similar, which can handle European accented characters at least, so it'll work.

print( u"bonjour, %s" % "fréd" ) # Call this "print B"

will also work.

If, OTOH, you direct to a file from the CLI, the stdout encoding will be None, which will default to ASCII (on my OS anyway), which will not be able to handle either of the above prints... (dreaded encoding error).

So then you might think of redirecting your stdout by using

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

and try running in the CLI piping to a file... Very oddly, print A above will work... But print B above will throw the encoding error! The following will however work OK:

print( u"bonjour, " + "fréd" ) # Call this "print C"

The conclusion I have come to (provisionally) is that if a string which is specified to be a Unicode string using the "u" prefix is submitted to the %-handling mechanism it appears to involve the use of the default environment encoding, regardless of whether you have set stdout to redirect!

How people deal with this is a matter of choice. I would welcome a Unicode expert to say why this happens, whether I've got it wrong in some way, what the preferred solution to this, whether it also applies to CPython, whether it happens in Python 3, etc., etc.

Carlie answered 7/3, 2014 at 20:44 Comment(2)
That's not odd, that's because "fréd" is a byte sequence and not a Unicode string, so the codecs.getwriter wrapper will leave it alone. You need a leading u, or from __future__ import unicode_literals.Lynea
@MatthiasUrlichs OK... thanks... But I just find encoding one of the most infuriating aspects of IT. Where do you get your understanding from? For example, I just posted another question about encoding here: #44483567: this is about Java, Eclipse, Cygwin & Gradle. If your expertise goes this far, please help... above all I'd like to know where to learn more!Carlie
H
3

I ran into this problem in a legacy application, and it was difficult to identify where what was printed. I helped myself with this hack:

# encoding_utf8.py
import codecs
import builtins


def print_utf8(text, **kwargs):
    print(str(text).encode('utf-8'), **kwargs)


def print_utf8(fn):
    def print_fn(*args, **kwargs):
        return fn(str(*args).encode('utf-8'), **kwargs)
    return print_fn


builtins.print = print_utf8(print)

On top of my script, test.py:

import encoding_utf8
string = 'Axwell Λ Ingrosso'
print(string)

Note that this changes ALL calls to print to use an encoding, so your console will print this:

$ python test.py
b'Axwell \xce\x9b Ingrosso'
Harlin answered 22/2, 2018 at 12:55 Comment(0)
P
2

I could "automate" it with a call to:

def __fix_io_encoding(last_resort_default='UTF-8'):
  import sys
  if [x for x in (sys.stdin,sys.stdout,sys.stderr) if x.encoding is None] :
      import os
      defEnc = None
      if defEnc is None :
        try:
          import locale
          defEnc = locale.getpreferredencoding()
        except: pass
      if defEnc is None :
        try: defEnc = sys.getfilesystemencoding()
        except: pass
      if defEnc is None :
        try: defEnc = sys.stdin.encoding
        except: pass
      if defEnc is None :
        defEnc = last_resort_default
      os.environ['PYTHONIOENCODING'] = os.environ.get("PYTHONIOENCODING",defEnc)
      os.execvpe(sys.argv[0],sys.argv,os.environ)
__fix_io_encoding() ; del __fix_io_encoding

Yes, it's possible to get an infinite loop here if this "setenv" fails.

Primatology answered 15/3, 2012 at 9:59 Comment(1)
interesting, but a pipe doesn't seem to be happy about thisWorthwhile
Q
2

On Windows, I had this problem very often when running a Python code from an editor (like Sublime Text), but not if running it from command-line.

In this case, check your editor's parameters. In the case of SublimeText, this Python.sublime-build solved it:

{
  "cmd": ["python", "-u", "$file"],
  "file_regex": "^[ ]*File \"(...*?)\", line ([0-9]*)",
  "selector": "source.python",
  "encoding": "utf8",
  "env": {"PYTHONIOENCODING": "utf-8", "LANG": "en_US.UTF-8"}
}
Quianaquibble answered 15/11, 2019 at 12:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.