Windows cmd encoding change causes Python crash
Asked Answered
F

10

63

First I change Windows CMD encoding to utf-8 and run Python interpreter:

chcp 65001
python

Then I try to print a unicode sting inside it and when i do this Python crashes in a peculiar way (I just get a cmd prompt in the same window).

>>> import sys
>>> print u'ëèæîð'.encode(sys.stdin.encoding)

Any ideas why it happens and how to make it work?

UPD: sys.stdin.encoding returns 'cp65001'

UPD2: It just came to me that the issue might be connected with the fact that utf-8 uses multi-byte character set (kcwu made a good point on that). I tried running the whole example with 'windows-1250' and got 'ëeaî?'. Windows-1250 uses single-character set so it worked for those characters it understands. However I still have no idea how to make 'utf-8' work here.

UPD3: Oh, I found out it is a known Python bug. I guess what happens is that Python copies the cmd encoding as 'cp65001 to sys.stdin.encoding and tries to apply it to all the input. Since it fails to understand 'cp65001' it crashes on any input that contains non-ascii characters.

Foushee answered 18/5, 2009 at 17:52 Comment(3)
can you print sys.stdin.encoding? what does it return?Venus
It's easy for python to know how to deal with the 'cp65001' codec: one has to add a line to Lib/encodings/aliases.py , mapping 'cp65001' to 'utf_8'. I created a patch for that, and also updated the bug you mention, Alex. There are still issues, though.Parris
related: Python, Unicode, and the Windows consoleRegalado
M
86

Update: On Python 3.6 or later, printing Unicode strings to the console on Windows just works.

In Python 3.8 or later the underlying bug described in this question has been fixed by making cp65001 an alias for utf-8, as pointed out in Boris Verkhovskiy's answer.

So essentially, upgrade to recent Python and you're done. At this point I recommend using 2to3 to update your code to Python 3.x if needed, and just dropping support for Python 2.x. Note that there has been no security support for any version of Python before 3.7 (including Python 2.7) since December 2021.

If you really still need to support earlier versions of Python (including Python 2.7), you can use https://github.com/Drekin/win-unicode-console , which was originally based on the code in this answer that uses WriteConsoleW.


Previous answer

Here's how to alias cp65001 to UTF-8 without changing encodings\aliases.py:

import codecs
codecs.register(lambda name: codecs.lookup('utf-8') if name == 'cp65001' else None)

(IMHO, don't pay any attention to the silliness about cp65001 not being identical to UTF-8 at http://bugs.python.org/issue6058#msg97731 . It's intended to be the same, even if Microsoft's codec has some minor bugs.)

Here is some code (written for Tahoe-LAFS, tahoe-lafs.org) that makes console output work regardless of the chcp code page, and also reads Unicode command-line arguments. Credit to Michael Kaplan for the idea behind this solution. If stdout or stderr are redirected, it will output UTF-8. If you want a Byte Order Mark, you'll need to write it explicitly.

[Edit: This version uses WriteConsoleW instead of the _O_U8TEXT flag in the MSVC runtime library, which is buggy. WriteConsoleW is also buggy relative to the MS documentation, but less so.]

import sys
if sys.platform == "win32":
    import codecs
    from ctypes import WINFUNCTYPE, windll, POINTER, byref, c_int
    from ctypes.wintypes import BOOL, HANDLE, DWORD, LPWSTR, LPCWSTR, LPVOID

    original_stderr = sys.stderr

    # If any exception occurs in this code, we'll probably try to print it on stderr,
    # which makes for frustrating debugging if stderr is directed to our wrapper.
    # So be paranoid about catching errors and reporting them to original_stderr,
    # so that we can at least see them.
    def _complain(message):
        print >>original_stderr, message if isinstance(message, str) else repr(message)

    # Work around <http://bugs.python.org/issue6058>.
    codecs.register(lambda name: codecs.lookup('utf-8') if name == 'cp65001' else None)

    # Make Unicode console output work independently of the current code page.
    # This also fixes <http://bugs.python.org/issue1602>.
    # Credit to Michael Kaplan <http://www.siao2.com/2010/04/07/9989346.aspx>
    # and TZOmegaTZIOY
    # <https://mcmap.net/q/25698/-windows-cmd-encoding-change-causes-python-crash/1432462#1432462>.
    try:
        # <http://msdn.microsoft.com/en-us/library/ms683231(VS.85).aspx>
        # HANDLE WINAPI GetStdHandle(DWORD nStdHandle);
        # returns INVALID_HANDLE_VALUE, NULL, or a valid handle
        #
        # <http://msdn.microsoft.com/en-us/library/aa364960(VS.85).aspx>
        # DWORD WINAPI GetFileType(DWORD hFile);
        #
        # <http://msdn.microsoft.com/en-us/library/ms683167(VS.85).aspx>
        # BOOL WINAPI GetConsoleMode(HANDLE hConsole, LPDWORD lpMode);

        GetStdHandle = WINFUNCTYPE(HANDLE, DWORD)(("GetStdHandle", windll.kernel32))
        STD_OUTPUT_HANDLE = DWORD(-11)
        STD_ERROR_HANDLE = DWORD(-12)
        GetFileType = WINFUNCTYPE(DWORD, DWORD)(("GetFileType", windll.kernel32))
        FILE_TYPE_CHAR = 0x0002
        FILE_TYPE_REMOTE = 0x8000
        GetConsoleMode = WINFUNCTYPE(BOOL, HANDLE, POINTER(DWORD))(("GetConsoleMode", windll.kernel32))
        INVALID_HANDLE_VALUE = DWORD(-1).value

        def not_a_console(handle):
            if handle == INVALID_HANDLE_VALUE or handle is None:
                return True
            return ((GetFileType(handle) & ~FILE_TYPE_REMOTE) != FILE_TYPE_CHAR
                    or GetConsoleMode(handle, byref(DWORD())) == 0)

        old_stdout_fileno = None
        old_stderr_fileno = None
        if hasattr(sys.stdout, 'fileno'):
            old_stdout_fileno = sys.stdout.fileno()
        if hasattr(sys.stderr, 'fileno'):
            old_stderr_fileno = sys.stderr.fileno()

        STDOUT_FILENO = 1
        STDERR_FILENO = 2
        real_stdout = (old_stdout_fileno == STDOUT_FILENO)
        real_stderr = (old_stderr_fileno == STDERR_FILENO)

        if real_stdout:
            hStdout = GetStdHandle(STD_OUTPUT_HANDLE)
            if not_a_console(hStdout):
                real_stdout = False

        if real_stderr:
            hStderr = GetStdHandle(STD_ERROR_HANDLE)
            if not_a_console(hStderr):
                real_stderr = False

        if real_stdout or real_stderr:
            # BOOL WINAPI WriteConsoleW(HANDLE hOutput, LPWSTR lpBuffer, DWORD nChars,
            #                           LPDWORD lpCharsWritten, LPVOID lpReserved);

            WriteConsoleW = WINFUNCTYPE(BOOL, HANDLE, LPWSTR, DWORD, POINTER(DWORD), LPVOID)(("WriteConsoleW", windll.kernel32))

            class UnicodeOutput:
                def __init__(self, hConsole, stream, fileno, name):
                    self._hConsole = hConsole
                    self._stream = stream
                    self._fileno = fileno
                    self.closed = False
                    self.softspace = False
                    self.mode = 'w'
                    self.encoding = 'utf-8'
                    self.name = name
                    self.flush()

                def isatty(self):
                    return False

                def close(self):
                    # don't really close the handle, that would only cause problems
                    self.closed = True

                def fileno(self):
                    return self._fileno

                def flush(self):
                    if self._hConsole is None:
                        try:
                            self._stream.flush()
                        except Exception as e:
                            _complain("%s.flush: %r from %r" % (self.name, e, self._stream))
                            raise

                def write(self, text):
                    try:
                        if self._hConsole is None:
                            if isinstance(text, unicode):
                                text = text.encode('utf-8')
                            self._stream.write(text)
                        else:
                            if not isinstance(text, unicode):
                                text = str(text).decode('utf-8')
                            remaining = len(text)
                            while remaining:
                                n = DWORD(0)
                                # There is a shorter-than-documented limitation on the
                                # length of the string passed to WriteConsoleW (see
                                # <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1232>.
                                retval = WriteConsoleW(self._hConsole, text, min(remaining, 10000), byref(n), None)
                                if retval == 0 or n.value == 0:
                                    raise IOError("WriteConsoleW returned %r, n.value = %r" % (retval, n.value))
                                remaining -= n.value
                                if not remaining:
                                    break
                                text = text[n.value:]
                    except Exception as e:
                        _complain("%s.write: %r" % (self.name, e))
                        raise

                def writelines(self, lines):
                    try:
                        for line in lines:
                            self.write(line)
                    except Exception as e:
                        _complain("%s.writelines: %r" % (self.name, e))
                        raise

            if real_stdout:
                sys.stdout = UnicodeOutput(hStdout, None, STDOUT_FILENO, '<Unicode console stdout>')
            else:
                sys.stdout = UnicodeOutput(None, sys.stdout, old_stdout_fileno, '<Unicode redirected stdout>')

            if real_stderr:
                sys.stderr = UnicodeOutput(hStderr, None, STDERR_FILENO, '<Unicode console stderr>')
            else:
                sys.stderr = UnicodeOutput(None, sys.stderr, old_stderr_fileno, '<Unicode redirected stderr>')
    except Exception as e:
        _complain("exception %r while fixing up sys.stdout and sys.stderr" % (e,))


    # While we're at it, let's unmangle the command-line arguments:

    # This works around <http://bugs.python.org/issue2128>.
    GetCommandLineW = WINFUNCTYPE(LPWSTR)(("GetCommandLineW", windll.kernel32))
    CommandLineToArgvW = WINFUNCTYPE(POINTER(LPWSTR), LPCWSTR, POINTER(c_int))(("CommandLineToArgvW", windll.shell32))

    argc = c_int(0)
    argv_unicode = CommandLineToArgvW(GetCommandLineW(), byref(argc))

    argv = [argv_unicode[i].encode('utf-8') for i in xrange(0, argc.value)]

    if not hasattr(sys, 'frozen'):
        # If this is an executable produced by py2exe or bbfreeze, then it will
        # have been invoked directly. Otherwise, unicode_argv[0] is the Python
        # interpreter, so skip that.
        argv = argv[1:]

        # Also skip option arguments to the Python interpreter.
        while len(argv) > 0:
            arg = argv[0]
            if not arg.startswith(u"-") or arg == u"-":
                break
            argv = argv[1:]
            if arg == u'-m':
                # sys.argv[0] should really be the absolute path of the module source,
                # but never mind
                break
            if arg == u'-c':
                argv[0] = u'-c'
                break

    # if you like:
    sys.argv = argv

Finally, it is possible to grant ΤΖΩΤΖΙΟΥ's wish to use DejaVu Sans Mono, which I agree is an excellent font, for the console.

You can find information on the font requirements and how to add new fonts for the windows console in the 'Necessary criteria for fonts to be available in a command window' Microsoft KB

But basically, on Vista (probably also Win7):

  • under HKEY_LOCAL_MACHINE_SOFTWARE\Microsoft\Windows NT\CurrentVersion\Console\TrueTypeFont, set "0" to "DejaVu Sans Mono";
  • for each of the subkeys under HKEY_CURRENT_USER\Console, set "FaceName" to "DejaVu Sans Mono".

On XP, check the thread 'Changing Command Prompt fonts?' in LockerGnome forums.

Mindexpanding answered 15/7, 2010 at 19:35 Comment(9)
+1 because your answer is worthy, plus a virtual +1 for the font issue suggestion, even though it's too late (I and Windows have had a break-up with lots of fights; I don't think we'll ever be together again but for brief encounters at friends' computers :) Thanks.Parris
@David-Sarah: Thanks for the very useful code! Do you happen to know if there's a corresponding way to fix console input (so that e.g. copy-pasted unicode characters Just Work, irrespective of codepage etc.) This would presumably involve ReadConsoleW?Delude
That is possible, and indeed it would use ReadConsoleW. I was originally going to write that code but I haven't been using Windows for some time now. If your interest is in Python 3, the relevant bug is bugs.python.org/issue1602 , although it doesn't have a solution for input yet. (A patch for that bug would depend on Python 3 internals and wouldn't be easily adaptable to Python 2.x.)Mindexpanding
I've got IOError: [Errno 0] ErrorOverabundance
I try to solve the problems around bugs.python.org/issue1602 for Python 3 in my project github.com/Drekin/win-unicode-console. The package is on PyPI: pypi.python.org/pypi/win_unicode_console. It actually builds on code from the issue, which originates in this your code.Polyneuritis
This works well but ANSI color escape codes are ignored. Some comments elsewhere suggest that WriteConsoleW should still handle the ANSI escapes. Is there any known way to fix this?Accumulate
@KevinThibedeau: use colorama package.Polyneuritis
@user87690. I am using Colorama. It does not work when cp65001 is active.Accumulate
@KevinThibedeau: I don't know why should cp65001 matter. On the other hand, there is no need to use cp65001.Polyneuritis
W
48

Set PYTHONIOENCODING system variable:

> chcp 65001
> set PYTHONIOENCODING=utf-8
> python example.py
Encoding is utf-8

Source of example.py is simple:

import sys
print "Encoding is", sys.stdin.encoding
Wheelock answered 11/10, 2012 at 7:26 Comment(8)
I tried this in Python 2.7.5, and while sys.stdin.encoding and sys.stdout.encoding both said utf-8 it didn't generate the proper output. It showed each byte of output as individual characters instead of combining them into codepoints.Lushy
python -c "import sys; print('Encoding='+sys.stdin.encoding)" instead of making a file.Ardath
This one somehow worked for me on Windows 7.0 x64. In my case encoding was 720.Parts
I run MobaSSH server on a Windows 10. I work on this server remotely by logging in via SSH. Python didnt work at all when I used OpenSSHv6.7 that comes installed with the Windows 10 VM. If I use the MobaSSH server, it works better, but Python gave out this errorUnequaled
Fatal Python error: Py_Initialize: can't initialize sys standard streams LookupError: unknown encoding: cp28591 Current thread 0x00000874 (most recent call first):Unequaled
Although, OP didnt mention this specific error, and I didnt find anywhere else that anybody else found this error, I did encounter this. After following the above steps the problem was resolved.Unequaled
Thought I would leave it here.. just in case somebody came acorss the same problem: serverfault.com/questions/901041/…Unequaled
How can I permanently set this? I have the same problem each time I reload conda promtCensure
K
7

For me setting this env var before execution of python program worked:

set PYTHONIOENCODING=utf-8
Kino answered 23/4, 2018 at 7:23 Comment(0)
N
2

Do you want Python to encode to UTF-8?

>>>print u'ëèæîð'.encode('utf-8')
ëèæîð

Python will not recognize cp65001 as UTF-8.

Nekton answered 18/5, 2009 at 18:21 Comment(0)
P
2

I had this annoying issue, too, and I hated not being able to run my unicode-aware scripts same in MS Windows as in linux. So, I managed to come up with a workaround.

Take this script (say, uniconsole.py in your site-packages or whatever):

import sys, os

if sys.platform == "win32":
    class UniStream(object):
        __slots__= ("fileno", "softspace",)

        def __init__(self, fileobject):
            self.fileno = fileobject.fileno()
            self.softspace = False

        def write(self, text):
            os.write(self.fileno, text.encode("utf_8") if isinstance(text, unicode) else text)

    sys.stdout = UniStream(sys.stdout)
    sys.stderr = UniStream(sys.stderr)

This seems to work around the python bug (or win32 unicode console bug, whatever). Then I added in all related scripts:

try:
    import uniconsole
except ImportError:
    sys.exc_clear()  # could be just pass, of course
else:
    del uniconsole  # reduce pollution, not needed anymore

Finally, I just run my scripts as needed in a console where chcp 65001 is run and the font is Lucida Console. (How I wish that DejaVu Sans Mono could be used instead… but hacking the registry and selecting it as a console font reverts to a bitmap font.)

This is a quick-and-dirty stdout and stderr replacement, and also does not handle any raw_input related bugs (obviously, since it doesn't touch sys.stdin at all). And, by the way, I've added the cp65001 alias for utf_8 in the encodings\aliases.py file of the standard lib.

Parris answered 16/9, 2009 at 11:42 Comment(3)
This is working perfectly! Also, add at least an empty def flush(self): pass to the class for it to be compatible with stderr/stdout (possibly more methods are missing, but Twisted only complained about .flush() missing).Excusatory
After having used your snippet, looks like David-Sarah Hopwood's snippet works more universally.Excusatory
_DebuggerOutput has no attribute filenoVicechancellor
P
1

For unknown encoding: cp65001 issue, can set new Variable as PYTHONIOENCODING and Value as UTF-8. (This works for me)

View this:
View this

Password answered 8/12, 2017 at 7:3 Comment(0)
T
1

The problem has been solved and addressed in this thread:

Change the system encoding

The solution is to deselect the Unicode UTF-8 for worldwide support in Win. It will require a restart, upon which your Python should be back to normal.

Steps for Win:

  1. Go to Control Panel
  2. Select Clock and Region
  3. Click Region > Administrative
  4. In Language for non-Unicode programs click on the “Change system locale”
  5. In popped up window “Region Settings” untick “Beta: Use Unicode UTF-8...”
  6. Restart the machine as per the Win prompt

The picture to show exact location of how to solve the issue:

How to resolve the issue

Trogon answered 22/1, 2019 at 15:42 Comment(6)
A link to a solution is welcome, but please ensure your answer is useful without it: add context around the link so your fellow users will have some idea what it is and why it’s there, then quote the most relevant part of the page you're linking to in case the target page is unavailable. Answers that are little more than a link may be deleted.Iso
Instead of posting an answer which merely links to another answer, please instead flag the question as a duplicate.Iso
Thank you @Zoe . I have added more information into my response, as unfortunately my current reputation does not allow me to flag the question as a duplicate. But will certainly take a note for the future.Trogon
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - From ReviewSkin
Thanks @Robert. I have added the details, in case link becomes invalid. Tried adding a picture, but again my reputation does not permit this feature yet.Trogon
Finally! A solution that works! I'll never get those wasted hours back but at least I can move on to more interesting problems.Custommade
T
1

Starting with Python 3.8+ the encoding cp65001 is an alias for utf-8

https://docs.python.org/library/codecs.html#standard-encodings

Tarshatarshish answered 7/2, 2020 at 14:14 Comment(0)
M
0

This is because "code page" of cmd is different to "mbcs" of system. Although you changed the "code page", python (actually, windows) still think your "mbcs" doesn't change.

Merylmes answered 18/5, 2009 at 18:12 Comment(0)
O
0

A few comments: you probably misspelled encodig and .code. Here is my run of your example.

C:\>chcp 65001
Active code page: 65001

C:\>\python25\python
...
>>> import sys
>>> sys.stdin.encoding
'cp65001'
>>> s=u'\u0065\u0066'
>>> s
u'ef'
>>> s.encode(sys.stdin.encoding)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
LookupError: unknown encoding: cp65001
>>>

The conclusion - cp65001 is not a known encoding for python. Try 'UTF-16' or something similar.

Obryant answered 18/5, 2009 at 18:29 Comment(1)
Yes, I definitely misspelled it, but I tried it the right way and the same crash (this actually proves that the interpreter didn't actually get to evaluate the misspelled 'encode()' and 'encoding()' attributes and crashed while processing 'ëèæîð'. I fixed the typo.Foushee

© 2022 - 2024 — McMap. All rights reserved.