Displaying Unicode in PowerShell
Asked Answered
B

7

84

What I'm trying to achieve should be rather straightforward although PowerShell is trying to make it hard.

I want to display the full path of files, some with Arabic, Chinese, Japanese and Russian characters in their names.

I always get some undecipherable output, such as the one shown below:

Enter image description here

The output seen in the console is being consumed as is by another script. The output contains ? instead of the actual characters.

The command executed is

(Get-ChildItem -Recurse -Path "D:\test" -Include *unicode* | Get-ChildItem -Recurse).FullName

Is there an easy way to launch PowerShell (via the command line or in a fashion that can be written into a script) such that the output is seen correctly?

P.S. I've gone through many similar questions on Stack Overflow, but none of them have much input other than calling it a Windows Console Subsystem issue.

Bogusz answered 25/3, 2018 at 13:21 Comment(3)
What font are you using in your PowerShell console? Are you certain that it has the problem languages in it?Lloyd
https://mcmap.net/q/15590/-monospace-unicode-font-closed may be relevant.Lloyd
How to add additional fonts to the Windows consoleLitharge
D
128

Note:

  • On Windows, with respect to rendering Unicode characters, it is primarily the choice of font / console (terminal) application that matters.

    • Nowadays, using Windows Terminal, which is distributed and updated via the Microsoft Store since Windows 10, is a good replacement for the legacy console host (console windows provided by conhost.exe), providing superior Unicode character support. In Windows 11 22H2, Windows Terminal even became the default console (terminal).
  • With respect to programmatically processing Unicode characters when communicating with external programs, $OutputEncoding, [Console]::InputEncoding and [Console]::OutputEncoding matter too - see below.


The PowerShell (Core) 7+ perspective (see next section for Windows PowerShell), irrespective of character rendering issues (also covered in the next section), with respect to communicating with external programs:

  • On Unix-like platforms, PowerShell Core uses UTF-8 by default.

  • On Windows, it is the legacy system locale, via its OEM code page, that determines the default encoding in all consoles, including both Windows PowerShell and PowerShell Core console windows, though recent versions of Windows 10 now allow setting the system locale to code page 65001 (UTF-8); note that the feature is still in beta as of this writing, and using it has far-reaching consequences - see this answer.

    • If you do use that feature, PowerShell Core console windows will then automatically be UTF-8-aware, though in Windows PowerShell you'll still have to set $OutputEncoding to UTF-8 too (which in Core already defaults to UTF-8), as shown below.

    • Otherwise - notably on older Windows versions - you can use the same approach as detailed below for Windows PowerShell.


Making your Windows PowerShell console window Unicode (UTF-8) aware:

  • Pick a TrueType (TT) font that supports the specific scripts (writing systems, alphabets) whose characters you want to display properly in the console:

    • Important: While all TrueType fonts support Unicode in principle, they usually only support a subset of all Unicode characters, namely those corresponding to specific scripts (writing systems), such as the Latin script, the Cyrillic (Russian) script, ...
      In your particular case - if you must support Arabic as well as Chinese, Japanese and Russian characters - your only choice is SimSun-ExtB, which is available on Windows 10 only.
      See Wikipedia for a list of what Windows fonts target what scripts (alphabets).

    • To change the font, click on the icon in the top-left corner of the window and select Properties, then change to the Fonts tab and select the TrueType font of interest.

  • Additionally, for proper communication with external programs:

    • The console window's code page must be switched to 65001, the UTF-8 code page (which is usually done with chcp 65001, which, however, cannot be used directly from within a PowerShell session[1], but the PowerShell command below has the same effect).

    • Windows PowerShell must be instructed to use UTF-8 to communicate with external utilities too, both when sending pipeline input to external programs, via it $OutputEncoding preference variable (on decoding output from external programs, it is the encoding stored in [Console]::OutputEncoding that is applied).

The following magic incantation in Windows PowerShell does this (as stated, this implicitly performs chcp 65001):

$OutputEncoding = [Console]::InputEncoding = [Console]::OutputEncoding =
                    New-Object System.Text.UTF8Encoding

To persist these settings, i.e., to make your future interactive PowerShell sessions UTF-8-aware by default, add the command above to your $PROFILE file.

Note: Recent versions of Windows 10 now allow setting the system locale to code page 65001 (UTF-8) (the feature is still in beta as of Window 10 version 1903), which makes all console windows default to UTF-8, including Windows PowerShell's.
If you do use that feature, setting [Console]::InputEncoding / [Console]::OutputEncoding is then no longer strictly necessary, but you'll still have to set $OutputEncoding (which is not necessary in PowerShell Core, where $OutputEncoding already defaults to UTF-8).

Important:

  • These settings assume that any external utilities you communicate with expect UTF-8-encoded input and produce UTF-8 output.

    • CLIs written in Node.js fulfill that criterion, for instance.
    • Python scripts - if written with UTF-8 support in mind - can handle UTF-8 too (see this answer).
  • By contrast, these settings can break (older) utilities that only expect a single-byte encoding as implied by the system's legacy OEM code page.

    • Up to Windows 8.1, this even included standard Windows utilities such as find.exe and findstr.exe, which have been fixed in Windows 10.
    • See the bottom of this post for how to bypass this problem by switching to UTF-8 temporarily, on demand for invoking a given utility.
  • These settings apply to external programs only and are unrelated to the encodings that PowerShell's cmdlets use on output:

    • See this answer for the default character encodings used by PowerShell cmdlets; in short: If you want cmdlets in Windows PowerShell to default to UTF-8 (which PowerShell [Core] v6+ does anyway), add $PSDefaultParameterValues['*:Encoding'] = 'utf8' to your $PROFILE, but note that this will affect all calls to cmdlets with an -Encoding parameter in your sessions, unless that parameter is used explicitly; also note that in Windows PowerShell you'll invariably get UTF-8 files with BOM; conversely, in PowerShell [Core] v6+, which defaults to BOM-less UTF-8 (both in the absence of -Encoding and with -Encoding utf8, you'd have to use 'utf8BOM'.

Optional background information

Tip of the hat to eryksun for all his input.

  • While a TrueType font is active, the console-window buffer correctly preserves (non-ASCII) Unicode chars. even if they don't render correctly; that is, even though they may appear generically as ?, so as to indicate lack of support by the current font, you can copy & paste such characters elsewhere without loss of information, as eryksun notes.

  • PowerShell is capable of outputting Unicode characters to the console even without having switched to code page 65001 first.
    However, that by itself does not guarantee that other programs can handle such output correctly - see below.

  • When it comes to communicating with external programs via stdout (piping), PowersShell uses the character encoding specified in the $OutputEncoding preference variable, which defaults to ASCII(!) in Windows PowerShell, which means that any non-ASCII characters are transliterated to literal ? characters, resulting in information loss. (By contrast, commendably, PowerShell Core (v6+) now uses (BOM-less) UTF-8 as the default encoding, consistently.)

    • By contrast, however, passing non-ASCII arguments (rather than stdout (piped) output) to external programs seems to require no special configuration (it is unclear to me why that works); e.g., the following Node.js command correctly returns €: 1 even with the default configuration:
      node -pe "process.argv[1] + ': ' + process.argv[1].length" €
  • [Console]::OutputEncoding:

    • controls what character encoding is assumed when the console translates program output into console display characters.
    • also tells PowerShell what encoding to assume when capturing output from an external program.
      The upshot is that if you need to capture output from an UTF-8-producing program, you need to set [Console]::OutputEncoding to UTF-8 as well; setting $OutputEncoding only covers the input (to the external program) aspect.
  • [Console]::InputEncoding sets the encoding for keyboard input into a console[2] and also determines how PowerShell's CLI interprets data it receives via stdin (standard input).

  • If switching the console to UTF-8 for the entire session is not an option, you can do so temporarily, for a given call:

      # Save the current settings and temporarily switch to UTF-8.
      $oldOutputEncoding = $OutputEncoding; $oldConsoleEncoding = [Console]::OutputEncoding
      $OutputEncoding = [Console]::OutputEncoding = New-Object System.Text.Utf8Encoding
    
      # Call the UTF-8 program, using Node.js as an example.
      # This should echo '€' (`U+20AC`) as-is and report the length as *1*.
      $captured = '€' | node -pe "require('fs').readFileSync(0).toString().trim()"
      $captured; $captured.Length
    
      # Restore the previous settings.
      $OutputEncoding = $oldOutputEncoding; [Console]::OutputEncoding = $oldConsoleEncoding
    
  • Problems on older versions of Windows (pre-W10):

    • An active chcp value of 65001 breaking the console output of some external programs and even batch files in general in older versions of Windows may ultimately have stemmed from a bug in the WriteFile() Windows API function (as also used by the standard C library), which mistakenly reported the number of characters rather than bytes with code page 65001 in effect, as discussed in this blog post.
  • The resulting symptoms, according to a comment by bobince on this answer from 2008, are: "My understanding is that calls that return a number-of-bytes (such as fread/fwrite/etc) actually return a number-of-characters. This causes a wide variety of symptoms, such as incomplete input-reading, hangs in fflush, the broken batch files and so on."


Superior alternatives to the native Windows console (terminal), conhost.exe

eryksun suggests two alternatives to the native Windows console windows (conhost.exe), which provider better and faster Unicode character rendering, due to using the modern, GPU-accelerated DirectWrite/DirectX API instead of the "old GDI implementation [that] cannot handle complex scripts, non-BMP characters, or automatic fallback fonts."

  • Microsoft's own, open-source Windows Terminal, which is distributed and updated via the Microsoft Store since Windows 10 - see here for an introduction.

  • Long-established third-party alternative ConEmu, which has the advantage of working on older Windows versions too.


[1] Note that running chcp 65001 from inside a PowerShell session is not effective, because .NET caches the console's output encoding on startup and is unaware of later changes made with chcp (only changes made directly via [console]::OutputEncoding] are picked up).

[2] I am unclear on how that manifests in practice; do tell us, if you know.

Distich answered 25/3, 2018 at 22:49 Comment(15)
Example: Vim is multilang, but it doesn't show language specific chars properly. I've put 'set encoding=utf-8' in _vimrc to fix that inside vim editor, but i.e. vim --help still doesn't load those chars properly. By adding '$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding' to powershell profile I only made those chars disappear completely from the output. Any way to make it work? I've changed encoding of my PS profile to utf-8 because it was utf-16Delegacy
@Sharak: That's odd, and I have no explanation (I don't use Vim). Note that UTF-16 should be fine your profile, just like UTF-8 - both need a BOM however (except in PowerShell Core, where UTF-8 doesn't need a BOM).Distich
I use win 10 built-in powershell 5.1 so it's not Core and everything seems to work fine after saving 'c:\Users\<username>\Documents\WindowsPowerShell\Microsoft.PowerShell_profile.ps1' with encoding UTF-8 without BOM. Why exactly it needs to be with BOM?Delegacy
@Sharak: The BOM only matters if you have non-ASCII characters in your profile. If you do, and there's no BOM in your UTF-8 file, Windows PowerShell reads that file as an "ANSI" file, i.e., misinterprets the file.Distich
Made some testing and here's the weird thing. Setting UTF8 also changes WindowsCodePage to 65001. But to view language specific characters I need specifically 1250. Only when I use 'chcp 1250' i get all characters in 'vim -help' output, but then I loose ability to see proper characters in prompt. Also I can't even type any of those characters like ó,ł,ż or € into prompt. What's weird though is that using CMD and chcp 1250 I get everything right. Lang specific chars in output, tabing through files like ółżńść€.txt and I even can type all those chars into prompt. Why PS can't behave the same?Delegacy
@Sharak: Yes, as stated in the answer, the code page is set to 65001, the UTF-8 code page. To support code page 1250 only, use $OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = [System.Text.Encoding]::GetEncoding(1250). By contrast, UTF-8 is a global "alphabet" and therein lies its advantage. If your Vim version doesn't speak that global alphabet, using a legacy code page such as 1250 that is limited to 256 chars. is your only option. Over time, such legacy programs will go away.Distich
But why CMD does cover everything here by simply using 'chcp 1250' and in powershell you have to choose either proper encoding in output or in input?Delegacy
@Sharak: chcp was built for cmd.exe, which was the only shell for many years before PowerShell came along. Running chcp from inside PowerShell doesn't work reliably, and in any event you still need to set $OutputEncoding. You could argue that Windows PowerShell should provide a command analogous to chcp, but (for reasons unknown to me) one was never introduced. Note that PowerShell Core speaks UTF-8 natively, so with modern (non-legacy) programs it works globally, so such a command is no longer needed going forward (though having one to support legacy programs wouldn't hurt).Distich
This didn't work for me with this flat sign character: '♭' # (U+266D)Ocreate
@js2010: As stated in the answer, not all fonts support all characters. To get your particular example to work, choose one of the following: MS Gothic, NSimSun, SimSum-ExtB.Distich
You're right, although the backslash looks like a deer in MS Gothic. Strange how Consolas and Lucida Console work fine in the ISE.Ocreate
@Ocreate Curious, indeed - no idea why.Distich
$outputencoding is set to something strange starting with sbc in the ise. When I try to do it in the console, it says it's missing some assembly.Ocreate
I was trying to do new-object System.Text.SBCSCodePageEncoding, but it says the type doesn't exist. shrug That seems to be what the ISE and console are using, except for $outputencoding in the console.Ocreate
Thanks very useful, I resolved by adding this at the start of my ps script: $OutputEncoding = [System.Text.Encoding]::UTF8Coexist
C
11

Elaborated Alexander Martin's answer. For testing purposes, I have created some folders and files with valid names from different Unicode subranges as follows:

Valid names

For instance, with Courier New console font, replacement symbols are displayed instead of CJK characters in a PowerShell console:

Courier New

On the other hand, with SimSun console font, (poorly visible) replacement symbols are displayed instead of Arabic and Hebrew characters while CJK characters seem to be displayed correct:

SimSun

Please note that all replacement symbols are merely displayed whereas real characters are preserved as you can see in the following Copy&Paste from above PowerShell console:

(Get-ChildItem 'D:\bat\UnASCII Names\' -Dir).Name

Output:

Arabic (عَرَبِيّ‎)
CJK (中文(繁體))
Czech (Čeština)
Greek (Γρεεκ)
Hebrew (עִבְרִית)
Japanese (日本語)
MathBoldScript (𝓜𝓪𝓽𝓱𝓑𝓸𝓵𝓭𝓢𝓬𝓻𝓲𝓹𝓽)
Russian (русский язык)
Türkçe (Türkiye)
‹angles›
☺☻♥♦

For the sake of completeness, here are appropriate registry values to Enable More Fonts for the Windows Command Prompt (this works for the Windows PowerShell console as well):

(Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Console\TrueTypeFont' |
    Select-Object -Property [0-9]* | Out-String).Split(
        [System.Environment]::NewLine,
        [System.StringSplitOptions]::RemoveEmptyEntries) |
     Sort-Object

Sample output:

0       : Consolas
00      : Source Code Pro
000     : DejaVu Sans Mono
0000    : Courier New
00000   : Simplified Arabic Fixed
000000  : Unifont
0000000 : Lucida Console
932     : *MS ゴシック
936     : *新宋体
Citizen answered 28/3, 2018 at 12:25 Comment(2)
how to make NSimSun the default, so I don't need to change font every time PS starts upBirdcage
@Birdcage click the PowerShell icon in the top-left corner of the window and set a font under Defaults instead of Properties, or set it under Properties if you right-click the PowerShell icon (shortcut).Citizen
T
8

If you install Microsoft's "Windows Terminal" from the Microsoft Store (or the Preview version), it comes pre-configured for full Unicode localization.

Windows Terminal Preview with snowman ⛄, Arabic (عَرَبِيّ‎), CJK (中文(繁體)), Czech (Čeština), Greek (Γρεεκ), Hebrew (עִבְרִית), Japanese (日本語), MathBoldScript (𝓜𝓪𝓽𝓱𝓑𝓸𝓵𝓭𝓢𝓬𝓻𝓲𝓹𝓽), Russian (русский язык), Türkçe (Türkiye), ‹angles›, ☺☻♥♦

You still can't enter commands with special characters... unless you use WSL! 😍

Using WSL, we are able to run echo "snowman ⛄"

Tabard answered 11/3, 2021 at 19:3 Comment(1)
I ended up using this terminal. There are some other alternatives, but this one with tab support and proper Unicode is just what I need.Maxey
O
1

The PowerShell ISE is an option for displaying foreign characters: korean.txt is a UTF-8 encoded file:

cd C:\Users\js
Get-Content korean.txt

Output:

The Korean language (South Korean: 한국어/韓國語 Hangugeo; North
Korean: 조선말/朝鮮말 Chosŏnmal) is an East Asian language
spoken by about 77 million people.[3]
Ocreate answered 23/7, 2019 at 3:12 Comment(1)
To recap why use of the ISE is a bad idea (see the bottom section of this answer for details): (a) it's obsolescent and doesn't support PowerShell Core; (b) it's a development environment, not meant for end users running scripts in production, (c) it doesn't support interactive console applications.Distich
C
1

I was facing a similar challenge, working with the Amazon Translate. I installed the terminal from Windows Store and it works for me now!

Callender answered 10/3, 2021 at 2:29 Comment(0)
S
0

In normal PowerShell, all characters are displayed in the configured font. That’s why, e.g., Chinese or Cyrillic characters are broken with "Lucida Console" and many other fonts.

For Chinese characters, PowerShell ISE changes the font automatically to "DengXian".

You can find out which alternative font is used for your special character by copying them to Word or a similar program which is capable of displaying different fonts.

Signalize answered 22/6, 2021 at 14:41 Comment(0)
S
-1

Make sure you have a font containing all the problematic characters installed and set as your Win32 Console font. If I remember right, click the PowerShell icon in the top-left corner of the window and pick Properties. The resulting popup dialog should have an option to set the font used. It might have to be a bitmap (.FON or .FNT) font.

Suppository answered 25/3, 2018 at 17:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.