UTF-8 In Python logging, how?
Asked Answered
G

9

55

I'm trying to log a UTF-8 encoded string to a file using Python's logging package. As a toy example:

import logging

def logging_test():
    handler = logging.FileHandler("/home/ted/logfile.txt", "w",
                                  encoding = "UTF-8")
    formatter = logging.Formatter("%(message)s")
    handler.setFormatter(formatter)
    root_logger = logging.getLogger()
    root_logger.addHandler(handler)
    root_logger.setLevel(logging.INFO)

    # This is an o with a hat on it.
    byte_string = '\xc3\xb4'
    unicode_string = unicode("\xc3\xb4", "utf-8")

    print "printed unicode object: %s" % unicode_string

    # Explode
    root_logger.info(unicode_string)

if __name__ == "__main__":
    logging_test()

This explodes with UnicodeDecodeError on the logging.info() call.

At a lower level, Python's logging package is using the codecs package to open the log file, passing in the "UTF-8" argument as the encoding. That's all well and good, but it's trying to write byte strings to the file instead of unicode objects, which explodes. Essentially, Python is doing this:

file_handler.write(unicode_string.encode("UTF-8"))

When it should be doing this:

file_handler.write(unicode_string)

Is this a bug in Python, or am I taking crazy pills? FWIW, this is a stock Python 2.6 installation.

Gratiana answered 9/10, 2009 at 18:6 Comment(3)
Your code works perfectly fine here. I tried hard to make it fail, but I did not succeed.Hyo
And you are right, python is encoding it with UTF-8, because it asks the outfile what encoding to use, and you specified UTF-8, so that's all and well.Hyo
I had to hit the wayback machine to find the example you mentioned. Interesting.Inkblot
D
16

Check that you have the latest Python 2.6 - some Unicode bugs were found and fixed since 2.6 came out. For example, on my Ubuntu Jaunty system, I ran your script copied and pasted, removing only the '/home/ted/' prefix from the log file name. Result (copied and pasted from a terminal window):

vinay@eta-jaunty:~/projects/scratch$ python --version
Python 2.6.2
vinay@eta-jaunty:~/projects/scratch$ python utest.py 
printed unicode object: ô
vinay@eta-jaunty:~/projects/scratch$ cat logfile.txt 
ô
vinay@eta-jaunty:~/projects/scratch$ 

On a Windows box:

C:\temp>python --version
Python 2.6.2

C:\temp>python utest.py
printed unicode object: ô

And the contents of the file:

alt text

This might also explain why Lennart Regebro couldn't reproduce it either.

Duo answered 9/10, 2009 at 19:14 Comment(3)
Yes this was it. There was a bug in the python logging package that was fixed in a later version.Gratiana
I am running Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin on my iMac, and I still get the same error. Was the bug really fixed?Balikpapan
Yes, it was - it happened between 2.6.1 and 2.6.2, at revision 69448: svn.python.org/view?view=rev&revision=69448 - so you need to upgrade to a later revision.Duo
H
35

Having code like:

raise Exception(u'щ')

Caused:

  File "/usr/lib/python2.7/logging/__init__.py", line 467, in format
    s = self._fmt % record.__dict__
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

This happens because the format string is a byte string, while some of the format string arguments are unicode strings with non-ASCII characters:

>>> "%(message)s" % {'message': Exception(u'\u0449')}
*** UnicodeEncodeError: 'ascii' codec can't encode character u'\u0449' in position 0: ordinal not in range(128)

Making the format string unicode fixes the issue:

>>> u"%(message)s" % {'message': Exception(u'\u0449')}
u'\u0449'

So, in your logging configuration make all format string unicode:

'formatters': {
    'simple': {
        'format': u'%(asctime)-s %(levelname)s [%(name)s]: %(message)s',
        'datefmt': '%Y-%m-%d %H:%M:%S',
    },
 ...

And patch the default logging formatter to use unicode format string:

logging._defaultFormatter = logging.Formatter(u"%(message)s")
Huba answered 11/3, 2014 at 8:31 Comment(5)
What about Python 3.5? Shouldn't all strings be a unicode by default?Pilot
@JanuszSkonieczny do you have the same problem with Python 3Huba
Yes I did in docker container. I solved it by setting up a bunch of env variables connected to os encoding. For anyone stumbling here with the same problem see https://mcmap.net/q/339393/-encoding-problems-when-running-an-app-in-docker-python-java-ruby-with-ubuntu-containers-ascii-utf-8.Pilot
@JanuszSkonieczny I do in my code import locale; if locale.getpreferredencoding().upper() != 'UTF-8': locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')Huba
On Windows 10 (ntdll.dll, ver: 10.0.18362.1171) this can cause exception (code 0xc0000374) with Python x64 versions: 3.8.2, 3.9, 3.8.6, 3.7.1 (and potentially others) for system encoding set to cp1250 (and potentially others). Beware!Paste
D
16

Check that you have the latest Python 2.6 - some Unicode bugs were found and fixed since 2.6 came out. For example, on my Ubuntu Jaunty system, I ran your script copied and pasted, removing only the '/home/ted/' prefix from the log file name. Result (copied and pasted from a terminal window):

vinay@eta-jaunty:~/projects/scratch$ python --version
Python 2.6.2
vinay@eta-jaunty:~/projects/scratch$ python utest.py 
printed unicode object: ô
vinay@eta-jaunty:~/projects/scratch$ cat logfile.txt 
ô
vinay@eta-jaunty:~/projects/scratch$ 

On a Windows box:

C:\temp>python --version
Python 2.6.2

C:\temp>python utest.py
printed unicode object: ô

And the contents of the file:

alt text

This might also explain why Lennart Regebro couldn't reproduce it either.

Duo answered 9/10, 2009 at 19:14 Comment(3)
Yes this was it. There was a bug in the python logging package that was fixed in a later version.Gratiana
I am running Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin on my iMac, and I still get the same error. Was the bug really fixed?Balikpapan
Yes, it was - it happened between 2.6.1 and 2.6.2, at revision 69448: svn.python.org/view?view=rev&revision=69448 - so you need to upgrade to a later revision.Duo
C
15

I'm a little late, but I just came across this post that enabled me to set up logging in utf-8 very easily

Here the link to the post

or here the code:

root_logger= logging.getLogger()
root_logger.setLevel(logging.DEBUG) # or whatever
handler = logging.FileHandler('test.log', 'w', 'utf-8') # or whatever
formatter = logging.Formatter('%(name)s %(message)s') # or whatever
handler.setFormatter(formatter) # Pass handler as a parameter, not assign
root_logger.addHandler(handler)
Copyist answered 27/11, 2019 at 14:12 Comment(0)
I
9

I had a similar problem running Django in Python3: My logger died upon encountering some Umlauts (äöüß) but was otherwise fine. I looked through a lot of results and found none working. I tried

import locale; 
if locale.getpreferredencoding().upper() != 'UTF-8': 
    locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 

which I got from the comment above. It did not work. Looking at the current locale gave me some crazy ANSI thing, which turned out to mean basically just "ASCII". That sent me into totally the wrong direction.

Changing the logging format-strings to Unicode would not help. Setting a magic encoding comment at the beginning of the script would not help. Setting the charset on the sender's message (the text came from a HTTP-reqeust) did not help.

What DID work was setting the encoding on the file-handler to UTF-8 in settings.py. Because I had nothing set, the default would become None. Which apparently ends up being ASCII (or as I'd like to think about: ASS-KEY)

    'handlers': {
        'file': {
            'level': 'DEBUG',
            'class': 'logging.handlers.TimedRotatingFileHandler',
            'encoding': 'UTF-8', # <-- That was missing.
            ....
        },
    },
Inez answered 12/7, 2019 at 9:28 Comment(3)
thanks @Inez it saved me :) , just wanted to check Is this somehow equivalent to supervisord] environment=LC_ALL='en_US.UTF-8',LANG='en_US.UTF-8' as I found this from other thread but it doesn't seems to work for me.Shading
I honestly have no idea. Both codewise (see above) and OS wise (LC...LANG..). However: My educated guess would be, that LC and LANG affect the system on an operating level which may or may not propagate down to an individual file, while encoding directly affects that single bit-stream.Inez
thanks @Inez after hours of searching why my api was not logging some stuff I finally stumbled on your comment here. Works perfectly now.Soberminded
N
2

Try this:

import logging

def logging_test():
    log = open("./logfile.txt", "w")
    handler = logging.StreamHandler(log)
    formatter = logging.Formatter("%(message)s")
    handler.setFormatter(formatter)
    root_logger = logging.getLogger()
    root_logger.addHandler(handler)
    root_logger.setLevel(logging.INFO)

    # This is an o with a hat on it.
    byte_string = '\xc3\xb4'
    unicode_string = unicode("\xc3\xb4", "utf-8")

    print "printed unicode object: %s" % unicode_string

    # Explode
    root_logger.info(unicode_string.encode("utf8", "replace"))


if __name__ == "__main__":
    logging_test()

For what it's worth I was expecting to have to use codecs.open to open the file with utf-8 encoding but either that's the default or something else is going on here, since it works as is like this.

Nutriment answered 9/10, 2009 at 18:17 Comment(1)
@Gank you are using python 3 I guessHuba
C
1

If I understood your problem correctly, the same issue should arise on your system when you do just:

str(u'ô')

I guess automatic encoding to the locale encoding on Unix will not work until you have enabled locale-aware if branch in the setencoding function in your site module via locale. This file usually resides in /usr/lib/python2.x, it worth inspecting anyway. AFAIK, locale-aware setencoding is disabled by default (it's true for my Python 2.6 installation).

The choices are:

  • Let the system figure out the right way to encode Unicode strings to bytes or do it in your code (some configuration in site-specific site.py is needed)
  • Encode Unicode strings in your code and output just bytes

See also The Illusive setdefaultencoding by Ian Bicking and related links.

Capriola answered 9/10, 2009 at 20:24 Comment(0)
S
1

If you use python 3.7 or later, before running your python script, set the environment variable PYTHONUTF8 to 1

For example, if you use linux:

export PYTHONUTF8=1

Powershell:

$env:PYTHONUTF8 = "1"

Windows command Line:

set PYTHONUTF8=1

Then execute your python script.

Staffard answered 13/10, 2023 at 3:44 Comment(0)
O
0

In Python 3.10, I managed to log Unicode characters (Greek letters in my case) by adding encoding='utf-8'.

Small example:

import logging
import sys

if __name__ == "__main__":
    logging.basicConfig(filename="log.log", filemode="w", level=logging.DEBUG, encoding="utf-8")
    root = logging.getLogger()
    root.setLevel(logging.DEBUG)
    handler = logging.StreamHandler(sys.stdout)
    handler.setLevel(logging.DEBUG)
    formatter = logging.Formatter(" %(levelname)s - %(message)s")  # %(asctime)s - %(name)s -
    handler.setFormatter(formatter)
    root.addHandler(handler)
    logging.debug("Γεια σου μαρία")
Olag answered 2/8, 2023 at 12:6 Comment(0)
C
0

Python 3.11.8, this works for me.
https://gist.github.com/jtatum/5311955

import logging

# Add a file handler with utf-8 encoding
handler = logging.FileHandler('output.log', 'w',
                              encoding = 'utf-8')
root_logger = logging.getLogger()
root_logger.addHandler(handler)
Caparison answered 20/2 at 2:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.