The output character encoding may depend on specific commands e.g.:
#!/usr/bin/env python3
import subprocess
import sys
encoding = 'utf-32'
cmd = r'''$env:PYTHONIOENCODING = "%s"; py -3 -c "print('\u270c')"''' % encoding
data = subprocess.check_output(["powershell", "-C", cmd])
print(sys.stdout.encoding)
print(data)
print(ascii(data.decode(encoding)))
Output
cp437
b"\xff\xfe\x00\x00\x0c'\x00\x00\r\x00\x00\x00\n\x00\x00\x00"
'\u270c\r\n'
✌ (U+270C) character is received successfully.
The character encoding of the child script is set using PYTHONIOENCODING
envvar inside the PowerShell session. I've chosen utf-32
for the output encoding so that it would be different from Windows ANSI and OEM code pages for the demonstration.
Notice that the stdout encoding of the parent Python script is OEM code page (cp437
in this case) -- the script is run from the Windows console. If you redirect the output of the parent Python script to a file/pipe then ANSI code page (e.g., cp1252
) is used by default in Python 3.
To decode powershell output that might contain characters undecodable in the current OEM code page, you could set [Console]::OutputEncoding
temporarily (inspired by @eryksun's comments):
#!/usr/bin/env python3
import io
import sys
from subprocess import Popen, PIPE
char = ord('✌')
filename = 'U+{char:04x}.txt'.format(**vars())
with Popen(["powershell", "-C", '''
$old = [Console]::OutputEncoding
[Console]::OutputEncoding = [Text.Encoding]::UTF8
echo $([char]0x{char:04x}) | fl
echo $([char]0x{char:04x}) | tee {filename}
[Console]::OutputEncoding = $old'''.format(**vars())],
stdout=PIPE) as process:
print(sys.stdout.encoding)
for line in io.TextIOWrapper(process.stdout, encoding='utf-8-sig'):
print(ascii(line))
print(ascii(open(filename, encoding='utf-16').read()))
Output
cp437
'\u270c\n'
'\u270c\n'
'\u270c\n'
Both fl
and tee
use [Console]::OutputEncoding
for stdout (the default behavior is as if | Write-Output
is appended to the pipelines). tee
uses utf-16, to save a text to a file. The output shows that ✌ (U+270C) is decoded successfully.
$OutputEncoding
is used to decode bytes in the middle of a pipeline:
#!/usr/bin/env python3
import subprocess
cmd = r'''
$OutputEncoding = [Console]::OutputEncoding = New-Object System.Text.UTF8Encoding
py -3 -c "import os; os.write(1, '\U0001f60a'.encode('utf-8')+b'\n')" |
py -3 -c "import os; print(os.read(0, 512))"
'''
subprocess.check_call(["powershell", "-C", cmd])
Output
b'\xf0\x9f\x98\x8a\r\n'
that is correct: b'\xf0\x9f\x98\x8a'.decode('utf-8') == u'\U0001f60a'
. With the default $OutputEncoding
(ascii) we would get b'????\r\n'
instead.
Note:
b'\n'
is replaced with b'\r\n'
despite using binary API such as os.read/os.write
(msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
has no effect here)
b'\r\n'
is appended if there is no newline in the output:
#!/usr/bin/env python3
from subprocess import check_output
cmd = '''py -3 -c "print('no newline in the input', end='')"'''
cat = '''py -3 -c "import os; os.write(1, os.read(0, 512))"''' # pass as is
piped = check_output(['powershell', '-C', '{cmd} | {cat}'.format(**vars())])
no_pipe = check_output(['powershell', '-C', '{cmd}'.format(**vars())])
print('piped: {piped}\nno pipe: {no_pipe}'.format(**vars()))
Output:
piped: b'no newline in the input\r\n'
no pipe: b'no newline in the input'
The newline is appended to the piped output.
If we ignore lone surrogates then setting UTF8Encoding
allows to pass via pipes all Unicode characters including non-BMP characters. Text mode could be used in Python if $env:PYTHONIOENCODING = "utf-8:ignore"
is configured.
In interactive powershell running Get-NetAdapter | select Name | fl
displayed correctly the name even its non-cp437 character.
If stdout is not redirected then Unicode API is used, to print characters to the console -- any [BMP] Unicode character can be displayed if the console (TrueType) font supports it.
When I called powershell from python non-ascii characters were converted to closest ascii characters (e.g. ā to a, ž to z) and .decode(ascii) worked nicely.
It might be due to System.Text.InternalDecoderBestFitFallback
set for [Console]::OutputEncoding
-- if a Unicode character can't be encoded in a given encoding then it is passed to the fallback (either a best fit char or '?'
is used instead of the original character).
Could this behavior (and correspondingly solution) be Windows version dependent? I am on Windows 10, but users could be on older Windows down to Windows 7.
If we ignore bugs in cp65001 and a list of new encodings that are supported in later versions then the behavior should be the same.
powershell
output as Unicode text then you should put it into the title (I don't know what "default Windows display language encoding" is supposed to be). Check whether powershell accepts an explicit parameter to specify its stdout encoding ($OutputEncoding
). Unrelated: use a string on Windows to pass a command i.e., use'a | b | c'
instead of['a', '|', 'b', '|', 'c']
. – Menorca'mbcs'
(Windows encoding)? Is it encoding fromchcp
output (Windows "ANSI" encoding)? Is it leaking of Unicode API abstractions (UCS-2 or UTF-16le w/out BOM)? The question: "how to get a powershell stdout for a given command that might contain arbitrary Unicode characters?" is different from «what is "default Windows display language encoding"?». – MenorcaWriteConsoleW()
). It is Unicode API and therefore it works whateverchcp
returns. The redirected (to a pipe) stdout does not use this API (Popen(cmd, stdout=PIPE)
case). Python 3 useslocale.getpreferredencoding(False)
encoding in this case (something like cp1252 -- ANSI code page ('mbcs'
equiv.)) while some command-line applications may use OEM code page (e.g., cp437 fromchcp
) here. – Menorcalocale.getprefferedencoding(False)
returnscp1252
for me. And still.decode(ascii)
works fine on my machine with non-cp1252 characters in adapter names as in the UPDATE part above. – Pastoralistuniversal_newlines=True
enables text mode (yes. It is not intuitive spelling) (2) both cp437 and cp1252 are compatible with ascii encoding for ascii characters (working.decode('ascii', 'strict')
says that all bytes in stdout are in ascii range. It can't differentiate between cp437 and cp1252). – Menorca$OutputEncoding = New-Object -typename System.Text.UTF8Encoding
(in powershell) and.decode('utf-8')
(in Python) instead. – Menorcaprint(check_output(['powershell', 'echo É']))
? (I'm not sure how to write'echo É'
in PowerShell). If you seeb'\x90'
in the output then the encoding is cp437. If you seeb'\xc9'
then the encoding is cp1252. btw., you could usefor line in io.TextIOWrapper(process.stdout, encoding='utf-8'):
if you don't want to call.decode('utf-8')
. – Menorca$OutputEncoding
is ascii and therefore the above command probably producesb'E'
(if something strips non-ascii parts) i.e., if you want to get non-ascii characters then you should set$OutputEncoding
correspondingly (utf-8 is a good candidate). – Menorcactypes.windll.kernel32.SetConsoleOutputCP(1252);
p = subprocess.Popen('powershell echo $([char]0xc9)', stdout=subprocess.PIPE);
p.stdout.read()
. Weirdly if I passcreationflags=DETACHED_PROCESS
, such that powershell.exe doesn't attach to a console, the silly thing doesn't even have a sensible default of the ANSI codepage. It outputs nothing at all. – Ulu