Best output type and encoding practices for __repr__() functions?
Asked Answered
S

3

66

Lately, I've had lots of trouble with __repr__(), format(), and encodings. Should the output of __repr__() be encoded or be a unicode string? Is there a best encoding for the result of __repr__() in Python? What I want to output does have non-ASCII characters.

I use Python 2.x, and want to write code that can easily be adapted to Python 3. The program thus uses

# -*- coding: utf-8 -*-
from __future__ import unicode_literals, print_function  # The 'Hello' literal represents a Unicode object

Here are some additional problems that have been bothering me, and I'm looking for a solution that solves them:

  1. Printing to an UTF-8 terminal should work (I have sys.stdout.encoding set to UTF-8, but it would be best if other cases worked too).
  2. Piping the output to a file (encoded in UTF-8) should work (in this case, sys.stdout.encoding is None).
  3. My code for many __repr__() functions currently has many return ….encode('utf-8'), and that's heavy. Is there anything robust and lighter?
  4. In some cases, I even have ugly beasts like return ('<{}>'.format(repr(x).decode('utf-8'))).encode('utf-8'), i.e., the representation of objects is decoded, put into a formatting string, and then re-encoded. I would like to avoid such convoluted transformations.

What would you recommend to do in order to write simple __repr__() functions that behave nicely with respect to these encoding questions?

Stuffy answered 2/9, 2010 at 13:57 Comment(0)
S
42

In Python2, __repr__ (and __str__) must return a string object, not a unicode object. In Python3, the situation is reversed, __repr__ and __str__ must return unicode objects, not byte (née string) objects:

class Foo(object):
    def __repr__(self):
        return u'\N{WHITE SMILING FACE}' 

class Bar(object):
    def __repr__(self):
        return u'\N{WHITE SMILING FACE}'.encode('utf8')

repr(Bar())
# ☺
repr(Foo())
# UnicodeEncodeError: 'ascii' codec can't encode character u'\u263a' in position 0: ordinal not in range(128)

In Python2, you don't really have a choice. You have to pick an encoding for the return value of __repr__.

By the way, have you read the PrintFails wiki? It may not directly answer your other questions, but I did find it helpful in illuminating why certain errors occur.


When using from __future__ import unicode_literals,

'<{}>'.format(repr(x).decode('utf-8'))).encode('utf-8')

can be more simply written as

str('<{}>').format(repr(x))

assuming str encodes to utf-8 on your system.

Without from __future__ import unicode_literals, the expression can be written as:

'<{}>'.format(repr(x))
Solent answered 2/9, 2010 at 14:1 Comment(13)
It would be nice if the documentation mentioned this :) (docs.python.org/reference/datamodel.html#basic-customization does not)… Anyway… you would say that the approach in point 4 in the question is cumbersome but necessary, right?Stuffy
EOL: Assuming Python2, repr(x) must return an encoded string. If it was encoded with utf-8, then repr(x).decode('utf8').encode('utf8') should not be necessary. If repr(x) is encoded with some other encoding, repr(x).decode('utf8') will either fail (with UnicodeDecodeError) or produce bogus results, or maybe decode correctly by lucky happenstance. So, AFAIK, repr(x).decode('utf8').encode('utf8') should never be necessary. Can you provide an example?Solent
@EOL, The return value must be a string object. is how the reference manual page you point to expresses the constraint that the return value must be an instance of str (a unicode object would not be "a string object"). repr is normally expected to return ascii only (thing of repr(uo) for all unicode objects, for example: even that returns ascii only -- I think no built-in or standard library type behaves otherwise) but strictly speaking that is not a language constraint, so it's not the reference manual's business. Proposed docs patches are always welcome, btw!-)Illdisposed
@Alex: Thank you for the comments. I guess that my confusion comes from the fact that one also says "Unicode string", in Python 2.x: that's why I was wondering whether __repr__() could also return a Unicode string… I have been thinking of submitting doc patches. :)Stuffy
@~unutbu: I should have put parentheses in the example, which differs from what you put in the comment: the decoded object is put into a formatting string before encoding. I updated the original question.Stuffy
@EOL, yes, I find string-related terminology ("string", "unicode string", "raw string", ...) unfortunately at risk of ambiguity in common discourse -- I try to always use rigorously non-ambiguous terms such as "str instance", "unicode object", "rawstring literal ", and so forth, but sometimes such rigorous terminology feels stilted in non-formal contexts. In the Language Reference, the only occurrences of the unfortunate "unicode string" are in a single paragraph in 2.4.1 (literals): s/string/object/ there and "string" becomes unambiguous in the Language Reference (where it matters).Illdisposed
It's also possible that the Language Reference is deliberately ambiguous because it's not supposed to be a Reference for CPython only, but for all conforming Python implementations: in Jython and IronPython, which we're very keen to consider fully conforming implementations, all strings are Unicode (and it would be costly and totally against their respective platforms to make things otherwise). Maybe we do need a supplemental CPython implementation-specific reference, as an addition to the implementation-neutral Language one.Illdisposed
@~unutbu: since from __future__ import unicode_literals is in force, '<{}>' is a Unicode string. So, it looks again like you're confirming that what I'm doing is correct; it's good to get such a confirmation. I'll mark your question as accepted if you can remove the part that assumes that '<{}>' is a str.Stuffy
@EOL: Ah, I forgot about unicode_literals. Yes, I agree with you then. If you didn't have unicode_literals turned on, however, you could write '<{}>'.format(repr(x)) instead of '<{}>'.format(repr(x).decode('utf-8'))).encode('utf-8'). Are you sure that from __future__ import unicode_literals is worth it?Solent
Of course, str('<{}>').format(repr(x)) would also work... See #810296Solent
@~unutbu: Unicode with Python 2.x is tricky: '<{}>'.format(repr(x)) does not work when you have bytes with value > 127 in the representation (because the literal creates a Unicode object)! Thank you for the str(…).format() suggestion. As for the from __future__, I like the fact that string literals are Unicode objects, because these objects correspond to Python 3's strings (one of the goals is to prepare the transition to Python 3).Stuffy
@EOL: I'm not sure that from __future__ import unicode_literals is helping you prepare for Python3. Think about what your code should look like in Python3. It would just be '<{}>'.format(repr(x)). Anything you write that deviates from that, even str('<{}>').format(repr(x)), is just cruft that will have to be fixed during the transition. Are you sure that '<{}>'.format(repr(x)) does not work if you turn off unicode_literals?Solent
@~unutbu: good point, about the simpler code when not using unicode_literals. I'll turn it off (in which case the simpler code does indeed work). If you can remove the part with "may be incorrect" (which refers to a different situation than that of the question, which assumed Unicode litterals), I'll mark your answer as accepted.Stuffy
W
6

I think a decorator can manage __repr__ incompatibilities in a sane way. Here's what i use:

from __future__ import unicode_literals, print_function
import sys

def force_encoded_string_output(func):

    if sys.version_info.major < 3:

        def _func(*args, **kwargs):
            return func(*args, **kwargs).encode(sys.stdout.encoding or 'utf-8')

        return _func

    else:
        return func


class MyDummyClass(object):

    @force_encoded_string_output
    def __repr__(self):
        return 'My Dummy Class! \N{WHITE SMILING FACE}'
Whittle answered 12/12, 2012 at 21:10 Comment(3)
Nice decorator; I modified it, though, so that _func is not defined when it is not needed. So, __repr__ in Python 2 can apparently return a Unicode string, according to your code (maybe because of unicode_literals?). This clashes with unutbu answer… I find the documentation ambiguous, on this (docs.python.org/2/reference/datamodel.html#object.__repr__, docs.python.org/2/reference/lexical_analysis.html#index-14). I would be interested in any reference information on this, just to be sure that no unforeseen problems can arise from having __repr__ return a Unicode string.Stuffy
@EOL So, __repr__ in Python 2 can apparently return a Unicode string (...) Why do you think so?Inexertion
Good catch, my bad. I will delete my earlier comment, as it is not relevant.Stuffy
I
1

I use a function like the following:

def stdout_encode(u, default='UTF8'):
    if sys.stdout.encoding:
        return u.encode(sys.stdout.encoding)
    return u.encode(default)

Then my __repr__ functions look like this:

def __repr__(self):
    return stdout_encode(u'<MyClass {0} {1}>'.format(self.abcd, self.efgh))
Inexperienced answered 17/5, 2012 at 15:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.