Powershell string variable with UTF-8 encoding
Asked Answered
H

2

3

I checked many related questions about this, but I couldn't find something that solves my problem. Basically, I want to store a UTF-8 encoded string in a variable and then use that string as a file name.

For example, I'm trying to download a YouTube video. If we print the video title, the non-English characters show up (ytd here is youtube-dl):

./ytd https://www.youtube.com/watch?v=GWYndKw_zbw -e

Output: [LEEPLAY] 시티팝 입문 City Pop MIX (Playlist)

But if I store this in a variable and print it, the Korean characters are ignored:

$vtitle= ./ytd https://www.youtube.com/watch?v=GWYndKw_zbw -e

$vtitle

Output:[LEEPLAY] City Pop MIX (Playlist)

Haihaida answered 17/10, 2019 at 17:43 Comment(5)
Have you tried it in the ISE? It might be working, but the console can't display it. (or OSX)Somniloquy
@js2010: There is no display problem here - as stated, the characters display fine in the console when printed directly. The real problem is a character-encoding mismatch when PowerShell reads the output into .NET strings. That problem would arise in the ISE as well. Speaking of: Have I ever mentioned that the ISE is obsolescent and should be avoided going forward (bottom section)?Walkling
@Walkling My experience copying and pasting the string, storing to a variable, and creating a file, all worked better in the ISE. The ISE is hardly obsolete, since it comes with the current version of Windows 10 and powershell 5.1. Btw, I tried downloading ytd and Cisco Amp quarantined it as a virus. It could be a false positive, but I would be cautious.Somniloquy
@js2010: The ISE never was a substitute for the regular console window, and while it provided a great editing experience, its differing behavior always caused headaches. I said obsolescent (not obsolete), because (a) it will receive no future development effort except for critical fixes, (b) it doesn't support PowerShell Core - and we're on the brink of PowerShell Core v7, meant to supersede Windows PowerShell.Walkling
@Somniloquy ytd here refers to youtube-dl, which is an open source program and is not a virus. It shouldn't be confused with the other program called "YTD Downloader", which is an ad-filled garbage.Kirwin
W
5

For a comprehensive overview of how PowerShell interacts with external programs, which includes sending data to them, see this answer.

When PowerShell interprets output from external programs (such as ytd in your case), it assumes that the output uses the character encoding reflected in [Console]::OutputEncoding.

Note:

  • Interpreting refers to cases where PowerShell captures (e.g., $output = ...), relays (e.g., ... | Select-String ...), or redirects (e.g., ... > output.txt) the external program's output.

  • By contrast, printing directly to the display may not be affected, because PowerShell then isn't involved, and certain CLIs adjust their behavior when their stdout isn't redirected to print directly to the console with full Unicode support (which explains why the characters looked as expected in your console when ytd's output printed directly to it).

If the encoding reported by [Console]::OutputEncoding is not the same encoding used by the external program at hand, PowerShell misinterprets the output.

To fix that, you must (temporarily) set [Console]::OutputEncoding] to match the encoding used by the external program.

For instance, let's assume an executable foo.exe that outputs UTF-8-encoded text:

# Save the current encoding and switch to UTF-8.
$prev = [Console]::OutputEncoding
[Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()

# PowerShell now interprets foo's output correctly as UTF-8-encoded.
# and $output will correctly contain CJK characters.
$output = foo https://example.org -e

# Restore the previous encoding.
[Console]::OutputEncoding = $prev

Important:

  • [Console]::OutputEncoding by default reflects the encoding associated with the legacy system locale's OEM code page, as reported by chcp (e.g. 437 on US-English systems).

    • Recent versions of Windows 10 now allow setting the system locale to code page 65001 (UTF-8) (the feature is still in beta as of Window 10 version 1909), which is great, considering that most modern command-line utilities "speak" UTF-8 - but note that making this system-wide change has far-reaching consequences - see this answer.

With the specific program at hand, youtube-dl, js2010 has discovered that capturing in a variable works without extra effort if you pass --encoding utf-16.

The reason this works is that the resulting UTF16-LE-encoded output is preceded by a BOM (Byte-Order Mark).

(Note that --encoding utf-8 does not work, because youtube-dl then does not emit a BOM.)

Windows PowerShell is capable of detecting and properly decoding UTF-16LE-encoded and UTF-8-encoded text irrespective of the effective [Console]::OutputEncoding] IF AND ONLY IF the output is preceded by a BOM.

Caveats:

  • This does not work in PowerShell Core (v6+, on any of the supported platforms).

  • Even in Windows PowerShell you'll rarely be able to take advantage of this obscure behavior, because using a BOM in stdout output is atypical (it is typically only used in files).

Walkling answered 17/10, 2019 at 18:27 Comment(1)
I had to add --encoding utf-8 as an argument to the program to get this to work, so I'm guessing youtube-dl doesn't output in UTF-8 by default.Kirwin
S
1

This works for me in the ISE. Youtube-dl is from ytdl-org.github.io. Actually the ise wouldn't be needed, but the filename will only show correctly in something like explorer.

# get title
# utf-16 has a bom, or use utf-8-sig, this program is python based
$a = .\youtube-dl -e https://www.youtube.com/watch?v=Qpy7N4oFQUQ --encoding utf-16
$a
Gacharic Spin - 赤裸ライアー教則映像(short ver.)TOMO-ZO編

You might have similar luck in vscode (or osx/linux).

Somniloquy answered 18/10, 2019 at 13:30 Comment(1)
+1 for a simpler solution, but it's important to note that while it works with the particular program at hand, youtube-dl, it will rarely work in other cases: (a) it fundamentally works in Windows PowerShell only, not in PowerShell Core; and (b) it is atypical for programs to emit a BOM when writing to stdout, which the technique depends on. It doesn't work with --encoding utf-8, because youtube-dl then doesn't emit a BOM.Walkling

© 2022 - 2024 — McMap. All rights reserved.