Dangers of sys.setdefaultencoding('utf-8')
Asked Answered
R

5

35

There is a trend of discouraging setting sys.setdefaultencoding('utf-8') in Python 2. Can anybody list real examples of problems with that? Arguments like it is harmful or it hides bugs don't sound very convincing.

UPDATE: Please note that this question is only about utf-8, it is not about changing default encoding "in general case".

Please give some examples with code if you can.

Rife answered 22/2, 2015 at 10:55 Comment(9)
how would you be using it? If you are talking about modifying sitecustomize.py then when the code is run on other computers you may well have bugsRecliner
If you have a decode or encode error it is probably for an obvious reason i.e s = u'é' str(s) . You should work with one type either string or unicode and handle the encoding explicitly.Recliner
@PadraicCunningham, #28643281, no global settings - application-only.Rife
might be relevant mail.python.org/pipermail/python-dev/2009-August/091406.html You can get strange effects caused by the fact that some string objects will now compare equal while not necessarily having the same hash value. Unicode objects and strings have the same hash value provided that they are both ASCII. With the ASCII default encoding, a non-ASCII string cannot be compared to a Unicode object, so the problem does not occur.Recliner
@PadraicCunningham, UTF-8 string is a not a Unicode object yet, and regardless of the encoding such string objects won't compare equal if they have different contents. Unless there is a bug in Python hash function,Rife
Because you are misunderstanding how Python works with encodings if you think you need it. Here’s a presentation of how to use it correctly: farmdev.com/talks/unicode – As an aside, if the argument “it hides bugs” doesn’t sound convincing to you, that may be the real problem. (And yes, Unicode in Python 2 sucks. But sys.setdefaultencoding isn’t the solution.) And lastly, if you want to see a bug it causes, look no further: https://mcmap.net/q/37496/-will-a-unicode-string-just-containing-ascii-characters-always-be-equal-to-the-ascii-stringAlessandro
@KonradRudolph, that's why I am asking for a real example that I can understand.Rife
@techtonik here's an example of a question where a user got screwed because the Author of PyDev thinks it's a good idea to set sys.setdefaultencoding('utf-8'). Here's a blog post of someone else that got screwed by this with some more details and further links.Potion
A nice posting today on the topic: anonbadger.wordpress.com/2015/06/16/…Neural
A
25

The original poster asked for code which demonstrates that the switch is harmful—except that it "hides" bugs unrelated to the switch.

Updates

  • [2020-11-01]: pip install setdefaultencoding
    Eradicates the need to reload(sys) (from Thomas Grainger).

  • [2019]: Personal experience with python3:

    • No unicode en/decoding problems. Reasons:
    • Got used to writing .encode('utf-8') .decode('utf-8') a (felt) 100 times a day.
    • Looking into libraries: Same. 'utf-8' either hardcoded or the silent default, in pretty much all the I/O done
    • Heavily improved byte strings support made it finally possible to convert I/O centric applications like mercurial.
    • Having to write .encode and .decode all the time got people aware of the difference between strings for humans and machines.

In my opinion, python2's bytestrings combined with (utf-8 default) decoding only before outputting to humans or unicode only formats would have been the technical superior approach, compared to decoding/encoding everything at ingress and at egress w/o actual need many many times. It depends on the application if something like the len() function is more practical, when returning the character count for humans, compared to returning the bytes used to store and forward by machines.

=> I think it's safe to say that UTF-8 everywhere saved the Unicode Sandwich Design.
Without that many libraries and applications, which only pass through strings w/o interpreting them could not work.

Summary of conclusions

(from 2017)

Based on both experience and evidence I've collected, here are the conclusions I've arrived at.

  1. Setting the defaultencoding to UTF-8 nowadays is safe, except for specialised applications, handling files from non unicode ready systems.

  2. The "official" rejection of the switch is based on reasons no longer relevant for a vast majority of end users (not library providers), so we should stop discouraging users to set it.

  3. Working in a model that handles Unicode properly by default is far better suited for applications for inter-systems communications than manually working with unicode APIs.

Effectively, modifying the default encoding very frequently avoids a number of user headaches in the vast majority of use cases. Yes, there are situations in which programs dealing with multiple encodings will silently misbehave, but since this switch can be enabled piecemeal, this is not a problem in end-user code.

More importantly, enabling this flag is a real advantage is users' code, both by reducing the overhead of having to manually handle Unicode conversions, cluttering the code and making it less readable, but also by avoiding potential bugs when the programmer fails to do this properly in all cases.


Since these claims are pretty much the exact opposite of Python's official line of communication, I think the an explanation for these conclusions is warranted.

Examples of successfully using a modified defaultencoding in the wild

  1. Dave Malcom of Fedora believed it is always right. He proposed, after investigating risks, to change distribution wide def.enc.=UTF-8 for all Fedora users.

    Hard fact presented though why Python would break is only the hashing behavior I listed, which is never picked up by any other opponent within the core community as a reason to worry about or even by the same person, when working on user tickets.

    Resume of Fedora: Admittedly, the change itself was described as "wildly unpopular" with the core developers, and it was accused of being inconsistent with previous versions.

  2. There are 3000 projects alone at openhub doing it. They have a slow search frontend, but scanning over it, I estimate 98% are using UTF-8. Nothing found about nasty surprises.

  3. There are 18000(!) github master branches with it changed.

    While the change is "unpopular" at the core community its pretty popular in the user base. Though this could be disregarded, since users are known to use hacky solutions, I don't think this is a relevant argument due to my next point.

  4. There are only 150 bugreports total on GitHub due to this. At a rate of effectively 100%, the change seems to be positive, not negative.

    To summarize the existing issues people have run into, I've scanned through all of the aforementioned tickets.

    • Chaging def.enc. to UTF-8 is typically introduced but not removed in the issue closing process, most often as a solution. Some bigger ones excusing it as temporary fix, considering the "bad press" it has, but far more bug reporters are justglad about the fix.

    • A few (1-5?) projects modified their code doing the type conversions manually so that they did not need to change the default anymore.

    • In two instances I see someone claiming that with def.enc. set to UTF-8 leads to a complete lack of output entirely, without explaining the test setup. I could not verify the claim, and I tested one and found the opposite to be true.

    • One claims his "system" might depend on not changing it but we do not learn why.

    • One (and only one) had a real reason to avoid it: ipython either uses a 3rd party module or the test runner modified their process in an uncontrolled way (it is never disputed that a def.enc. change is advocated by its proponents only at interpreter setup time, i.e. when 'owning' the process).

  5. I found zero indication that the different hashes of 'é' and u'é' causes problems in real-world code.

  6. Python does not "break"

    After changing the setting to UTF-8, no feature of Python covered by unit tests is working any differently than without the switch. The switch itself, though, is not tested at all.

  7. It is advised on bugs.python.org to frustrated users

    Examples here, here or here (often connected with the official line of warning)

    The first one demonstrates how established the switch is in Asia (compare also with the github argument).

  8. Ian Bicking published his support for always enabling this behavior.

    I can make my systems and communications consistently UTF-8, things will just get better. I really don't see a downside. But why does Python make it SO DAMN HARD [...] I feel like someone decided they were smarter than me, but I'm not sure I believe them.

  9. Martijn Fassen, while refuting Ian, admitted that ASCII might have been wrong in the first place.

    I believe if, say, Python 2.5, shipped with a default encoding of UTF-8, it wouldn't actually break anything. But if I did it for my Python, I'd have problems soon as I gave my code to someone else.

  10. In Python3, they don't "practice what they preach"

    While opposing any def.enc. change so harshly because of environment dependent code or implicitness, a discussion here revolves about Python3's problems with its 'unicode sandwich' paradigm and the corresponding required implicit assumptions.

    Further they created possibilities to write valid Python3 code like:

     >>> from 褐褑褒褓褔褕褖褗褘 import *        
     >>> def 空手(合氣道): あいき(ど(合氣道))
     >>> 空手(う힑힜('👏 ') + 흾)
     💔
    
  11. DiveIntoPython recommends it.

  12. In this thread, Guido himself advises a professional end user to use a process specific environt with the switch set to "create a custom Python environment for each project."

    The fundamental reason the designers of Python's 2.x standard library don't want you to be able to set the default encoding in your app, is that the standard library is written with the assumption that the default encoding is fixed, and no guarantees about the correct workings of the standard library can be made when you change it. There are no tests for this situation. Nobody knows what will fail when. And you (or worse, your users) will come back to us with complaints if the standard library suddenly starts doing things you didn't expect.

  13. Jython offers to change it on the fly, even in modules.

  14. PyPy did not support reload(sys) - but brought it back on user request within a single day without questions asked. Compare with the "you are doing it wrong" attitude of CPython, claiming without proof it is the "root of evil".


Ending this list I confirm that one could construct a module which crashes because of a changed interpreter config, doing something like this:

def is_clean_ascii(s):
    """ [Stupid] type agnostic checker if only ASCII chars are contained in s"""
    try:
        unicode(str(s))
        # we end here also for NON ascii if the def.enc. was changed
        return True
    except Exception, ex:
        return False    

if is_clean_ascii(mystr):
    <code relying on mystr to be ASCII>

I don't think this is a valid argument because the person who wrote this dual type accepting module was obviously aware about ASCII vs. non ASCII strings and would be aware of encoding and decoding.

I think this evidence is more than enough indication that changing this setting does not lead to any problems in real world codebases the vast majority of the time.

Alexandros answered 22/2, 2015 at 10:55 Comment(16)
Shouldn't this be a blog entry that you link to in a comment on Martijn's answer?Spraggins
thanks for the feedback, I provide now a summary of my investigations on top.Alexandros
This answer is really far too long, and unnecessarily so. Most of your supporting arguments, the ones that take up the bulk of your post, appear to be nothing more an argumentum ad populum at best, and a proof by verbosity at worst. Furthermore, the entire section about standardization and encoding is irrelevant and belongs in a blog post, not in an answer on Stack Overflow. Your answer would be much better if you simply distilled the technical reasons for your opinion, nothing more.Culch
Some specific comments: Setting a different default is like using goto. Sure, you can make it work, but you'll have a harder time for it as you develop the application. You get to be inconsistent in your handling of Unicode and that is going to bite you. Most people that use it do not understand Unicode and think this is the easy way out.Allred
Arguments that a lot of GitHub code uses it is not proof that it is okay to use, it can also be taken as proof most developers do not understand how to use Unicode properly. You see the same issues with how inexperienced developers use super(). Generally speaking, it is a Cargo Cult, applied and misapplied without understanding how it works or if it is needed at all.Allred
You are right, a default should, quite generally, never be changed, just because problems go away magically and you don't know why. You should know what u r doing. But IF you know what it does then Python2 is just way better to work with. Better than Py3 for me - but thats a different story ;-)Alexandros
I also begin to understand that your main problem with it seems to be the (agreed) fact that your code could get inconsistent regarding string types traveling through, some unicode some byte, while without the switch it would crash. Also here I'm with you: One should decide before writing the first Py2 l.o.c., if his lib or process should be working with unicode OR with bytes - consistently. We prefer bytes - with good reasons.Alexandros
@MartijnPieters "You get to be inconsistent in your handling of Unicode and that is going to bite you. " Could you elaborate on what exactly will be the problems biting us? So setdefaultencoding seems to be rather a safe way out. If something would break big time, wouldn't we have heard of it by now, and wouldn't that mean that thing which breaks on using another default encoding needs to be fixed? Thanks for your insight. IMO the way Python 2.x continues to refuse to handle ASCII > 127 by default is rather arcane (though I'm all in favor of Python otherwise)...Vorticella
@miraculixx: Python 2.0 was the first Python version to introduce Unicode support, in October 2000. It included the decision there and then to disable setting the default encoding. That means there is now 15 years of legacy code out there that relies on being able to catch an exception when you try to concatenate non-ASCII bytes to bytes that are not decodable as ASCII, etc. You cannot possibly fix all that code.Allred
@miraculixx: and what you call 'arcane' is called backwards and forwards compatibility, a requirement when your language is used by billions of computers in the world. Python 3 could make the switch, because it did not make any promises about compatibility.Allred
> That means there is now 15 years of legacy code out there that relies on being able to catch an exception (...). Actually, the 15-years of legacy code relies on the standard lib to work with unicode (i.e. sometext'.decode('whatever'), and not supporting changing the defaultenconding IMHO is akin of saying we're not sure whether unicode support actually works [in the stdlib]. Anyway I get your point. Essentially it means switching defaultencoding is not officially supported, however as this answers points out under some circumstances there are advantages of doing so. Thanks for your POV.Vorticella
Having this knowledge earlier we would have never needed Python 3, sick of wasting a decade of Python's community's time causing lack of innovationPinkard
@nehemiah: That pretty much sums up my original post into one line.Alexandros
Thanks very much for the analysis. I had a program using modules that used str() in a way that caused the UnicodeDecodeError exception, and no easy way to fix them. Using the def.enc. solution was the only way to tame "bugs" in the modules. I used the reload/setdefaultencoding only under very controlled circumstances (to contain possible side effects) and have had no problems. Your post helped to alleviate concerns about the side effects, so was helpful to make me more comfortable with my solution.Parameter
so you want to call sys.setdefaultencoding, but don't want to reload(sys)? introducing pip install setdefaultencoding ! >>> import setdefaultencoding >>> setdefaultencoding.setdefaultencodingTephrite
Thanks @ThomasGrainger - hope you don't mind that I mention this one in the OP.Alexandros
A
16

Because you don't always want to have your strings automatically decoded to Unicode, or for that matter your Unicode objects automatically encoded to bytes. Since you are asking for a concrete example, here is one:

Take a WSGI web application; you are building a response by adding the product of an external process to a list, in a loop, and that external process gives you UTF-8 encoded bytes:

results = []
content_length = 0

for somevar in some_iterable:
    output = some_process_that_produces_utf8(somevar)
    content_length += len(output)
    results.append(output)

headers = {
    'Content-Length': str(content_length),
    'Content-Type': 'text/html; charset=utf8',
}
start_response(200, headers)
return results

That's great and fine and works. But then your co-worker comes along and adds a new feature; you are now providing labels too, and these are localised:

results = []
content_length = 0

for somevar in some_iterable:
    label = translations.get_label(somevar)
    output = some_process_that_produces_utf8(somevar)

    content_length += len(label) + len(output) + 1
    results.append(label + '\n')
    results.append(output)

headers = {
    'Content-Length': str(content_length),
    'Content-Type': 'text/html; charset=utf8',
}
start_response(200, headers)
return results

You tested this in English and everything still works, great!

However, the translations.get_label() library actually returns Unicode values and when you switch locale, the labels contain non-ASCII characters.

The WSGI library writes out those results to the socket, and all the Unicode values get auto-encoded for you, since you set setdefaultencoding() to UTF-8, but the length you calculated is entirely wrong. It'll be too short as UTF-8 encodes everything outside of the ASCII range with more than one byte.

All this is ignoring the possibility that you are actually working with data in a different codec; you could be writing out Latin-1 + Unicode, and now you have an incorrect length header and a mix of data encodings.

Had you not used sys.setdefaultencoding() an exception would have been raised and you knew you had a bug, but now your clients are complaining about incomplete responses; there are bytes missing at the end of the page and you don't quite know how that happened.

Note that this scenario doesn't even involve 3rd party libraries that may or may not depend on the default still being ASCII. The sys.setdefaultencoding() setting is global, applying to all code running in the interpreter. How sure are you there are no issues in those libraries involving implicit encoding or decoding?

That Python 2 encodes and decodes between str and unicode types implicitly can be helpful and safe when you are dealing with ASCII data only. But you really need to know when you are mixing Unicode and byte string data accidentally, rather than plaster over it with a global brush and hope for the best.

Allred answered 10/4, 2015 at 12:42 Comment(18)
There is a mistake in you don't always want to have your strings automatically decoded to Unicode - the strings are decoded to UTF-8, not to Unicode objects.Rife
@techtonik: UTF-8 is an encoding, so they'd be encoded to UTF-8. That's the issue though, you get Unicode objects when you mix the two types; str + unicode gives you unicode, provided the str could be decoded.Allred
@techtonik: in my sample the translations.get_label() returns unicode objects. The WSGI implementation could also opt to just concatenate all the results, at which point you'd get one unicode object as output passed on to the socket, or perhaps to another WSGI wrapping label. We won't know, because we silenced all Python exceptions that normally would have been thrown.Allred
I don't get it. To me it is like you are saying that with sys.setdefaultencoding("utf-8") Python will start producing unicode objects in places where it was str previously. Is that right? (I am still reading through the example)Rife
A table about type conversion and contents of variable will definitely help to get that right.Rife
Python will try and decode str objects when concatenating with unicode objects, yes, and that will normally fail if those bytes are not decodable as ASCII. But as soon as you change the default codec, then bytes that are decodable as UTF-8 will also be converted and you do end up with Unicode objects where you thought you were producing byte values instead.Allred
So, the Python will not crash with non-ASCII strings anymore with sys.setdefaultencoding("utf-8"). I fail see how that this behaviour is bad for your example. In case of my application (Roundup) this is close to the crash I am trying to fix - #28643281Rife
@techtonik: we are going round in circles. You don't see this as bad, because you don't see how implicitly converting types can be bad. In a language where implicit conversions are the exception rather than the default, this is a huge issue, and you are changing the rules of that conversion at a global level. If this was configured per module instead, you'd be free to shoot yourself in the foot without also forcing the issue for any 3rd party library you may be using. But that's not the case here, and if you are not seeing a problem with such behaviour I don't know what to tell you.Allred
I see that things can be bad, but I don't see that there is a real world example of that changed behaviour was desired behaviour. In your example, the app will just crash on international symbol, which is happened in #28643281 when we added Unicode templating layer to Roundup, and sys.setdefaultencoding("utf-8") is the only recommended way to fix that crash. What I am hearing from you is that the crash is desired behaviour. I can not agree on that, sorry.Rife
the length you calculated is entirely wrong is a good argument though. pastebin.ubuntu.com/10791721 gives 3 and 6 on console. But this looks like a bug in Python, which is unable to handle mutibyte encodings.Rife
@techtonik: the desired behaviour would be to fix Roundup. If there is a bug in a 3rd party product, and the only work-around is to make a global change, then there is something wrong with that product.Allred
@techtonik: Why is that a bug in how Python handles a multibyte encoding? The length of a Unicode string should be the number of codepoints, not the number of bytes in an arbitrary codec. The length of a byte string should be the number of bytes. The Content Length header should contain the byte count, not the codepoint count. I don't see why this is a multi-byte vs. single-byte encoding issue.Allred
@techtonik: in your pastie are getting the length of byte strings, encoded to UTF-8. You get the same output without the sys.setdefaultencoding() call.Allred
Ok. So if we are not using len() for string processing, we are basically save to use sys.setdefaultencoding("utf-8") (which seems to be the case with Roundup core which seems to merely move utf-8 strings content from DB to the template layer).Rife
The problem with external libs will only appear if they use non-English chars themselves (badlib), or being fed utf-8 string for processing. Which leads to question #29587276 - how to trace that utf-8 strings are passed to external libs.Rife
The mentioned issue with Roundup is issues.roundup-tracker.org/issue2550811 - I'd like to know how'd you propose to fix it.Rife
@techtonik: using Jinja2 here reveals that Roundup is not practicing the Unicode sandwich approach; make all text in the application unicode at the point of entry as early as possible, and only encode to bytes at the point of exit, as late as possible. In this context, I recommend reading / seeing Ned Batchelder's Pragmatic Unicode presentation.Allred
To be more precise "but the byte length you calculated is entirely wrong". Assuming that the number of bytes in a string is equal to the number of characters is generally a bad idea, but was safe if str is ascii. Trying to write code in py2 with unicode_literals and be unicode everywhere, it seems like changing the default encoding would be great -- but I guess my real problem is I introduced a str somewhere. Thanks for the enlightening explanation.Beleaguer
A
3

First of all: Many opponents of changing default enc argue that its dumb because its even changing ascii comparisons

I think its fair to make clear that, compliant with the original question, I see nobody advocating anything else than deviating from Ascii to UTF-8.

The setdefaultencoding('utf-16') example seems to be always just brought forward by those who oppose changing it ;-)


With m = {'a': 1, 'é': 2} and the file 'out.py':

# coding: utf-8
print u'é' 

Then:

+---------------+-----------------------+-----------------+
| DEF.ENC       | OPERATION             | RESULT (printed)|            
+---------------+-----------------------+-----------------+
| ANY           | u'abc' == 'abc'       | True            |     
| (i.e.Ascii    | str(u'abc')           | 'abc'           |
|  or UTF-8)    | '%s %s' % ('a', u'a') | u'a a'          | 
|               | python out.py         | é               |
|               | u'a' in m             | True            |
|               | len(u'a'), len(a)     | (1, 1)          |
|               | len(u'é'), len('é')   | (1, 2) [*]      |
|               | u'é' in m             | False  (!)      |
+---------------+-----------------------+-----------------+
| UTF-8         | u'abé' == 'abé'       | True   [*]      |
|               | str(u'é')             | 'é'             |
|               | '%s %s' % ('é', u'é') | u'é é'          | 
|               | python out.py | more  | 'é'             |
+---------------+-----------------------+-----------------+
| Ascii         | u'abé' == 'abé'       | False, Warning  |
|               | str(u'é')             | Encoding Crash  |
|               | '%s %s' % ('é', u'é') | Decoding Crash  |
|               | python out.py | more  | Encoding Crash  |
+---------------+-----------------------+-----------------+

[*]: Result assumes the same é. See below on that.

While looking at those operations, changing the default encoding in your program might not look too bad, giving you results 'closer' to having Ascii only data.

Regarding the hashing ( in ) and len() behaviour you get the same then in Ascii (more on the results below). Those operations also show that there are significant differences between unicode and byte strings - which might cause logical errors if ignored by you.

As noted already: It is a process wide option so you just have one shot to choose it - which is the reason why library developers should really never ever do it but get their internals in order so that they do not need to rely on python's implicit conversions. They also need to clearly document what they expect and return and deny input they did not write the lib for (like the normalize function, see below).

=> Writing programs with that setting on makes it risky for others to use the modules of your program in their code, at least without filtering input.

Note: Some opponents claim that def.enc. is even a system wide option (via sitecustomize.py) but latest in times of software containerisation (docker) every process can be started in its perfect environment w/o overhead.


Regarding the hashing and len() behaviour:

It tells you that even with a modified def.enc. you still can't be ignorant about the types of strings you process in your program. u'' and '' are different sequences of bytes in the memory - not always but in general.

So when testing make sure your program behaves correctly also with non Ascii data.

Some say the fact that hashes can become unequal when data values change - although due to implicit conversions the '==' operations remain equal - is an argument against changing def.enc.

I personally don't share that since the hashing behaviour just remains the same as w/o changing it. Have yet to see a convincing example of undesired behaviour due to that setting in a process I 'own'.

All in all, regarding setdefaultencoding("utf-8"): The answer regarding if its dumb or not should be more balanced.

It depends. While it does avoid crashes e.g. at str() operations in a log statement - the price is a higher chance for unexpected results later since wrong types make it longer into code whose correct functioning depends on a certain type.

In no case it should be the alternative to learning the difference between byte strings and unicode strings for your own code.


Lastly, setting default encoding away from Ascii does not make your life any easier for common text operations like len(), slicing and comparisons - should you assume than (byte)stringyfying everything with UTF-8 on resolves problems here.

Unfortunately it doesn't - in general.

The '==' and len() results are far more complex problem than one might think - but even with the same type on both sides.

W/o def.enc. changed, "==" fails always for non Ascii, like shown in the table. With it, it works - sometimes:

Unicode did standardise around a million symbols of the world and gave them a number - but there is unfortunately NOT a 1:1 bijection between glyphs displayed to a user in output devices and the symbols they are generated from.

To motivate you research this: Having two files, j1, j2 written with the same program using the same encoding, containing user input:

>>> u1, u2 = open('j1').read(), open('j2').read()
>>> print sys.version.split()[0], u1, u2, u1 == u2

Result: 2.7.9 José José False (!)

Using print as a function in Py2 you see the reason: Unfortunately there are TWO ways to encode the same character, the accented 'e':

>>> print (sys.version.split()[0], u1, u2, u1 == u2)
('2.7.9', 'Jos\xc3\xa9', 'Jose\xcc\x81', False)

What a stupid codec you might say but its not the fault of the codec. Its a problem in unicode as such.

So even in Py3:

>>> u1, u2 = open('j1').read(), open('j2').read()
>>> print sys.version.split()[0], u1, u2, u1 == u2

Result: 3.4.2 José José False (!)

=> Independent of Py2 and Py3, actually independent of any computing language you use: To write quality software you probably have to "normalise" all user input. The unicode standard did standardise normalisation. In Python 2 and 3 the unicodedata.normalize function is your friend.

Alexandros answered 10/4, 2015 at 9:49 Comment(9)
You are assuming your source code is encoded to UTF-8 as well. Or that all your byte strings are UTF-8 encoded. Implicit encoding from Unicode to UTF-8, then concatenating that data with any other byte string using an arbitrary encoding would be a huge bug, and you masked it by setting the default encoding.Allred
Another issue is that code can rely on encoding or decoding errors to signal type differences. That includes 3rd party libraries. By setting a default encoding other than ASCII, you can no longer detect UTF-8 bytes -> Unicode and Unicode -> bytes implicit encodings where you meant to actually use explicit encodings.Allred
In any case, I've yet to come across a use-case where setting the default encoding was a better idea than handling encodings correctly. It's like using globals, you don't use them because in practice you significantly increase the likelyhood of bugs.Allred
So if testing ensures that your code works correctly with non-ASCII data, why not go the extra step and handle encoding and decoding correctly, and not mix types arbitrarily? Why rely on the setdefaultencoding() crutch at all?Allred
On the whole, I am not actually sure where you are going with this answer; yes, Unicode comparisons have their issues, but you are not actually saying anything clear about why sys.setdefaultencoding() should be avoided.Allred
thats right - the goal of my post was to make clear that 1. the answer to this question should be more balanced. 2. def.enc = utf-8 does not relief the developer of understanding byte and unicode string differences - for his own code 3. quality text processing is far more complex than novices might think even for the atomic operations like len() and comparisons.Alexandros
Categorically refusing 1. is in my view neglecting the problems people have out there especially with tons of legacy code - I dare to claim that much Py2 code out there was written by people driven by solving a specific problem outside of text processing - with tons of str() operations inside... Further, pretty fashionable languages like go and rust these days prove that its possible to work in a 'utf-8 byte string sandwich' and use unicode functions only when needed, intermediately.Alexandros
Python is of course not go or rust :-) I can see that there are legacy projects but that doesn't mean that when they get to unicode handling they should just set a global configuration that can have unintended consequences. Ferreting out the subtle bugs this can introduce are going to take just as much work as gating those sections and just decode your bytes to unicode objects at those points. That's at least the approach Plone is taking, for example.Allred
IMHO that's the best answer so far as it clearly shows the alternatives and consequences, as opposed to the dangerland! arguments. Thank you.Vorticella
M
2

Real-word example #1

It doesn't work in unit tests.

The test runner (nose, py.test, ...) initializes sys first, and only then discovers and imports your modules. By that time it's too late to change default encoding.

By the same virtue, it doesn't work if someone runs your code as a module, as their initialisation comes first.

And yes, mixing str and unicode and relying on implicit conversion only pushes the problem further down the line.

Misology answered 25/2, 2015 at 14:42 Comment(9)
unit test module imports main module that sets sys.defaultencoding('utf-8'), so why it doesn't work?Rife
Also, can you provide a real example where sys.defaultencoding('utf-8') doesn't work if somebody runs it as a module?Rife
@techtonik by the time tested module is imported, a bunch of other modules were imported and some other tests may have been ran. In addition, stdio was already initialised with system true default encoding. It's arguable you should not change default encoding on import at all, e.g. pydoc won't work right. Furthermore you should reset system to original state after your tests are done. In summary, if you only test your code and nothing else, and you only use implicit conversion for own data and not e.g. stdio, yes it may just work for you. But only you.Misology
"stdio was already initialised with system true default encoding" - isn't it always ascii?Rife
it seems that the real problem in your case is that all your unit tests are sharing the same interpreter. If unit test messes with global state, it should be isolated and run in separate interpreter. But for application scope all unit tests are consistent and use the same sys.defaultencoding('utf-8'). Also, note that I UTF-8 is critical for this question and it is backward compatible with ASCII.Rife
sys.setdefaultencoding() doesn't set input or output encoding; I think you misunderstood what the function does. It sets the codec used when implicitly encoding unicode to str or decoding str to unicode when mixing the types.Allred
Wether it works with unit tests or not is then dependent on the same factors as 3rd party libraries; if the code is relying on ASCII being the default then those tests may fail because that default was changed, globally.Allred
@techtonik re: mixing modules. Other modules are loaded first, they already imported sys. When your module runs, it's too late to change the encoding. Available hacks are sitecustomize.py and reload(sys). The earlier doesn't work with unit tests and is not composable. The latter is black magic, you're on your own.Misology
Indeed stdio is initialised based on PYTHONIOENCODING and locale. Thanks, @MartijnPieters.Misology
B
1

One thing we should know is

Python 2 use sys.getdefaultencoding() to decode/encode between str and unicode

conversion between str and unicode

so if we change default encoding, there will be all kinds of incompatible issues. eg:

# coding: utf-8
import sys

print "你好" == u"你好"
# False

reload(sys)
sys.setdefaultencoding("utf-8")

print "你好" == u"你好"
# True

More examples:

That said, I remember there is some blog suggesting use unicode whenever possible, and only bit string when deal with I/O. I think if your follow this convention, life will be much easier. More solutions can be found:

Billbillabong answered 30/6, 2016 at 7:56 Comment(11)
Is it possible to overload == operator for u-strings so that they always exit with an error when the implicit conversion like this occurs?Rife
No, you can't. In python there is no way to change the definition of builtin typeBillbillabong
From what I observe from the above, we 'must' use sys.setdefaultencoding("utf-8") all the time in order to make "你好" == u"你好" as True which is correctPinkard
@nehemiah: Exactly!! Just like 3 == 3.0 is also True. Equaliity is a statement about the information itself and not about which datatype it is wrapped into.Alexandros
2018 now and I still find it close to insane, that the same people who all the years refused to allow python the def.enc utf-8 switch, refused to repair broken behaviour like this, because it woud be "dangerous".... >>> print "abc" == u"abc" => True >>> print "你bc" == u"你bc" => False ...are the same which, in their unicode sandwich idea, accept a silent decode('utf-8') in pretty much ANY I/O lib of Python3.Alexandros
@nehemiah Better not. FYI I have updated my answer to provide a solution.Billbillabong
@JiacaiLiu: utf8everywhere.org - the unicode sandwich idea, i.e. unnecessarily decode all text values at I/O (and leave it to the I/O libs to do decode('utf-8') silently, everywhere) is plain broken, compared to using unicode as an api when you need semantic meaning of values for humans, which is rarely the case in computing. Further: In times of microservices everywhere, I/O is everywhere and systems within processing pipelines care about presence of text values, not their semantic meaning for humans. Decoding makes no sense and is error prone, in 99%.Alexandros
@JiacaiLiu Which one did you mean by solution? I notice the only solution to interface with Unicode in Python 2 is by sys.setdefaultencoding("utf-8")Pinkard
@nehemiah pythonhosted.org/kitchen/…Billbillabong
@RedPill I agree with you, maybe we can use some libraries to help us deal with this. pythonhosted.org/kitchen/…Billbillabong
@JiacaiLiu kitchen is a well crafted library. Still, many of the "frustrations" addressed in your link are simply not present with the defaultencoding to utf-8 switch. The world has agreed on UTF-8 as omnipresent text data encoding meanwhile - and that is the reason why Python3 works at all: Check any I/O lib (redis, httpie, ...) and you'll see the .decode('utf-8') everywhere in order to pass values into their "unicode sandwhich". With Py2 & dflt.encoding utf8 this all is not necessary, ideal world. One can use unicode as API where needed and proper conversion is done by the language.Alexandros

© 2022 - 2024 — McMap. All rights reserved.