Consequences of setting `LC_ALL=C.UTF-8` and `LANG=C.UTF-8`
Asked Answered
A

1

7

In order to fix the bug with packing a Python application as a snap I am ready to add this code:

# I don't know what I am doing
export LC_ALL=C.UTF-8
export LANG=C.UTF-8

There is a lot of text that seems to explain what a LC_ALL=C does (but not LC_ALL=C.UTF-8 or LANG=C.UTF-8) and a big text that explains the bug and the Python behavior. But none of them fits my small head. Usually I enjoy wrapping my head around the gory technical details, but lately a time pressure makes me rather ignorant.

I just want to know what is the meaning of the phrase This system supports the C.UTF-8 locale and what will happen if I set those variables to switch to it? (which I guess is made by setting those environment variables)

Albano answered 9/4, 2019 at 1:40 Comment(2)
In Python 3, the encoding of sys.std* is set at runtime through some heuristic involving env variables like LC_ALL. If I understand your case correctly, you can check if this works by inspecting the value of locale.getpreferredencoding(). It should be something like "UTF-8".Disrepair
Note: you should check if your locale support C-UTF8. Now it is obsolete, C is UTF8 in many systems. On some systems the locale is "UTF8" and on some "UTF-8" (python support both syntax, but not the locale utilities. locale -a show you which local you have installed. UTF-8 locales will break a lot utilities which have non UTF-8 text (so invalid sequences)Lynnett
O
3

The "C" locale turns off all internationalization, status/error messages are in English, there is no distinction between characters and bytes, sorting is by raw byte values. The meaning of bytes outside the ASCII range is not defined.

This works mostly ok for a program that works entirely with bytes, it can read those bytes, process them and output them again without caring about what exactly byte values in the range 0x80-0xFF mean.

However it causes big problems for python3's "convert everything to unicode" approach. If you don't know what byte values in the range 0x80-0xFF mean then you can't correctly convert them to Unicode. Python3 decides to raise an error in this case rather than making a potentially-incorrect assumption.

Using language locales in widely distributed scripts is problematic too though. The first problem is you can't be sure the locale for any particular language will be present on every system where you script will run. Secondly language-specific locales may have other settings which the scripter finds undesirable.

C.UTF-8 keeps most of the characteristics of the C locale, but specifies a UTF-8 encoding.

Overthrow answered 24/12, 2021 at 6:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.