Python: Which encoding is used for processing sys.argv?
Asked Answered
V

7

24

In what encoding are the elements of sys.argv, in Python? are they encoded with the sys.getdefaultencoding() encoding?

sys.getdefaultencoding(): Return the name of the current default string encoding used by the Unicode implementation.

PS: As pointed out in some of the answers, sys.stdin.encoding would indeed be a better guess. I would love to see a definitive answer to this question, though, with pointers to solid sources!

PPS: As Wim pointed out, Python 3 solves this issue by putting str objects in sys.argv (if I understand correctly). The question remains open for Python 2.x, though. Under Unix, the LC_CTYPE environment variable seems to be the correct thing to check, no? What should be done with Windows (so that sys.argv elements are correctly interpreted whatever the console)?

Voiceful answered 25/10, 2010 at 7:23 Comment(0)
F
4

"What should be done with Windows (so that sys.argv elements are correctly interpreted whatever the console)?"

For Python 2.x, see this comment on issue2128.

(Note that no encoding is correct for the original sys.argv, because some characters may have been mangled in ways that there is not enough information to undo; for example, if the ANSI codepage cannot represent Greek alpha then it will be mangled to 'a'.)

Freeforall answered 10/1, 2011 at 1:37 Comment(1)
Marked as accepted: this new comment on issue 2128 is new information! Thank you!Voiceful
M
8

I'm guessing that you are asking this because you ran into issue 2128. Note that this has been fixed in Python 3.0.

Mraz answered 3/11, 2010 at 9:44 Comment(2)
Thank you, I'll check the link. I am actually asking the question preventively, before writing a program that takes user messages from the command line.Voiceful
What about Python 2.x? and Windows?Voiceful
H
6

A few observations:

(1) It's certainly not sys.getdefaultencoding.

(2) sys.stdin.encoding appears to be a much better bet.

(3) On Windows, the actual value of sys.stdin.encoding will vary, depending on what software is providing the stdio. IDLE will use the system "ANSI" code page, e.g. cp1252 in most of Western Europe and America and former colonies thereof. However in the Command Prompt window, which emulates MS-DOS more or less, the corresponding old DOS code page (e.g. cp850) will be used by default. This can be changed by using the CHCP (change code page) command.

(4) The documentation for the subprocess module doesn't provide any suggestions on what encoding to use for args and stdout.

(5) One trusts that assert sys.stdin.encoding == sys.stdout.encoding never fails.

Hassler answered 25/10, 2010 at 9:38 Comment(3)
The observations seem to be correct, I have also observed the same. Do you have any idea of what exactly the sys.getdefaultencoding returns?Calorimeter
"It returns the name of the current default string encoding used by the Unicode implementation." I think it means that Python uses the defaultencoding() in its console. You can override the defaultencoding() by prepending u' by the way. Great answer +1Incapable
I agree about (2)--I thought of it later. (5) is actually not true: under Unix, python test.py > test.txt can for instance have UTF-8 for the stdin encoding and None for the stdout encoding.Voiceful
I
5

I don't know if this helps or not but this is what I get in DOS mode:

C:\Python27>python Lib\codingtest.py нер
['Lib\\codingtest.py', '\xed\xe5\xf0']

C:\Python27>python Lib\codingtest.py hello
['Lib\\codingtest.py', 'hello']

In IDLE:

>>> print "hello"
hello
>>> "hello"
'hello'
>>> "привет"
'\xef\xf0\xe8\xe2\xe5\xf2'
>>> print "привет"
привет
>>> sys.getdefaultencoding()
'ascii'
>>> 

What can we deduce from this? I don't know yet... I'll comment in a little bit.

A little bit later: sys.argv is encoded with sys.stdin.encoding and not sys.getdefaultencoding()

Incapable answered 25/10, 2010 at 7:46 Comment(1)
\xef is the UNICODE CP1251 Cyrillic representation of SMALL LETTER PE ('п'), thus I'm beginning to believe that sys.argv is encoded with sys.stin.encoding and not sys.getdefaultencoding()Incapable
B
4

On Unix systems, it should be in the user's locale, which is (strangely) not tied to sys.getdefaultencoding. See http://docs.python.org/library/locale.html.

In Windows, it'll be in the system ANSI codepage.

(By the way, those elementary school teachers who told you not to end a sentence with a preposition were lying to you.)

Babette answered 25/10, 2010 at 7:34 Comment(5)
Dangling prepositions is something up with which I shall not put. The supposed stricture against the dangling preposition apparently evolved from an observation on style. To wit, the first and last words of a sentence are those which have the most natural impact. Thus it was considered to be stylistically weak for a mere preposition to be placed in such a strategically important location.Trentontrepan
@Jim: Style is all well and good, but some people seem to have this silly notion that it's ungrammatical, leading to such goofiness as the title of this question.Babette
The title of this question seems clear enough though I might have suggested the use of which rather than "what." A more precise phrasing might be: "Which encoding is used for processing sys.argv?" The whole issue of text encoding has gotten rather complicated by all these attempts to accommodate both International character sets while preserving some of the simple ASCII string handling. The terminology surrounding the whole affair has become similarly convoluted.Trentontrepan
@Jim: The point--which was nothing but an amused aside, of course--was that writing that sentence naturally is perfectly fine: "What encoding is sys.argv in?". "In what encoding" isn't unclear, it's just peculiar and unnatural.Babette
For reference: I guess that this answer refers to locale.getdfaultlocale()[1] (docs.python.org/2/library/locale.html#locale.getdefaultlocale).Voiceful
F
4

"What should be done with Windows (so that sys.argv elements are correctly interpreted whatever the console)?"

For Python 2.x, see this comment on issue2128.

(Note that no encoding is correct for the original sys.argv, because some characters may have been mangled in ways that there is not enough information to undo; for example, if the ANSI codepage cannot represent Greek alpha then it will be mangled to 'a'.)

Freeforall answered 10/1, 2011 at 1:37 Comment(1)
Marked as accepted: this new comment on issue 2128 is new information! Thank you!Voiceful
R
1

As per https://docs.python.org/3/library/sys.html#sys.argv

argv is encoded with sys.getfilesystemencoding() using sys.getfilesystemencodeerrors().

See also https://www.python.org/dev/peps/pep-0383/ which explains the tricky way of how non-UTF8 sequences are encoded within that (UTF-8), when encoding="utf-8" ... by using surrogateescape as error handler.

Of intrest might also be os.fsdecode and os.fsencode.

Ripsaw answered 31/1, 2021 at 5:53 Comment(0)
H
0

sys.getfilesystemencoding() works for me, at least on Windows. On Windows it is actually 'mbcs', and 'utf-8' on *nix.

Hymanhymen answered 9/12, 2016 at 16:18 Comment(1)
Problem is, Windows has TWO correct codepages. GetACP() for GUI programs and GetOENCP() for text programs. For many languages those have the same values, but not for all...Stalemate

© 2022 - 2024 — McMap. All rights reserved.