Changing PowerShell's default output encoding to UTF-8
Asked Answered
C

3

187

By default, when you redirect the output of a command to a file or pipe it into something else in PowerShell, the encoding is UTF-16, which isn't useful. I'm looking to change it to UTF-8.

It can be done on a case-by-case basis by replacing the >foo.txt syntax with | out-file foo.txt -encoding utf8 but this is awkward to have to repeat every time.

The persistent way to set things in PowerShell is to put them in \Users\me\Documents\WindowsPowerShell\profile.ps1; I've verified that this file is indeed executed on startup.

It has been said that the output encoding can be set with $PSDefaultParameterValues = @{'Out-File:Encoding' = 'utf8'} but I've tried this and it had no effect.

https://blogs.msdn.microsoft.com/powershell/2006/12/11/outputencoding-to-the-rescue/ which talks about $OutputEncoding looks at first glance as though it should be relevant, but then it talks about output being encoded in ASCII, which is not what's actually happening.

How do you set PowerShell to use UTF-8?

Colloquy answered 18/10, 2016 at 2:54 Comment(1)
Looks like windows 11 defaults to utf8 [Console]::outputencoding.Lanford
L
253

Note:

  • The next section applies primarily to Windows PowerShell.

  • In both cases, the information applies to making PowerShell use UTF-8 for reading and writing files.

    • By contrast, for information on how to send and receive UTF-8-encoded strings to and from external programs, see this answer.
  • A system-wide switch to UTF-8 is possible nowadays (since recent versions of Windows 10): see this answer, but note the following caveats:

    • The feature has far-reaching consequences, because both the OEM and the ANSI code page are then set to 65001, i.e. UTF-8; also, the feature is still considered a beta feature as of this writing (Windows 11 22H2).
    • In Windows PowerShell, this takes effect only for those cmdlets that default to the ANSI code page, notably Set-Content, but not Out-File / >, and it also applies to reading files, notably including Get-Content and how PowerShell itself reads source code.

The Windows PowerShell perspective:

  • In PSv5.1 or higher, where > and >> are effectively aliases of Out-File, you can set the default encoding for > / >> / Out-File via the $PSDefaultParameterValues preference variable:

    • $PSDefaultParameterValues['Out-File:Encoding'] = 'utf8'
    • Note:
      • In Windows PowerShell (the legacy edition whose latest and final version is v5.1), this invariably creates UTF-8 file with a (pseudo) BOM.

        • Many Unix-based utilities do not recognize this BOM (see bottom); see this post for workarounds that create BOM-less UTF-8 files.
      • In PowerShell (Core) v6+, BOM-less UTF-8 is the default (see next section), but if you do want a BOM there, you can use 'utf8BOM'

  • In PSv5.0 or below, you cannot change the encoding for > / >>, but, on PSv3 or higher, the above technique does work for explicit calls to Out-File.
    (The $PSDefaultParameterValues preference variable was introduced in PSv3.0).

  • In PSv3.0 or higher, if you want to set the default encoding for all cmdlets that support
    an -Encoding parameter
    (which in PSv5.1+ includes > and >>), use:

    • $PSDefaultParameterValues['*:Encoding'] = 'utf8'

If you place this command in your $PROFILE, cmdlets such as Out-File and Set-Content will use UTF-8 encoding by default, but note that this makes it a session-global setting that will affect all commands / scripts that do not explicitly specify an encoding via their -Encoding parameter.

Similarly, be sure to include such commands in your scripts or modules that you want to behave the same way, so that they indeed behave the same even when run by another user or a different machine; however, to avoid a session-global change, use the following form to create a local copy of $PSDefaultParameterValues:

  • $PSDefaultParameterValues = @{ '*:Encoding' = 'utf8' }

For a summary of the wildly inconsistent default character encoding behavior across many of the Windows PowerShell standard cmdlets, see the bottom section.


The automatic $OutputEncoding variable is unrelated, and only applies to how PowerShell communicates with external programs (what encoding PowerShell uses when sending strings to them) - it has nothing to do with the encoding that the output redirection operators and PowerShell cmdlets use to save to files.


Optional reading: The cross-platform perspective: PowerShell Core:

PowerShell is now cross-platform, via its PowerShell Core edition, whose encoding - sensibly - defaults to BOM-less UTF-8, in line with Unix-like platforms.

  • This means that source-code files without a BOM are assumed to be UTF-8, and using > / Out-File / Set-Content defaults to BOM-less UTF-8; explicit use of the utf8 -Encoding argument too creates BOM-less UTF-8, but you can opt to create files with the pseudo-BOM with the utf8bom value.

  • If you create PowerShell scripts with an editor on a Unix-like platform and nowadays even on Windows with cross-platform editors such as Visual Studio Code and Sublime Text, the resulting *.ps1 file will typically not have a UTF-8 pseudo-BOM:

    • This works fine on PowerShell Core.
    • It may break on Windows PowerShell, if the file contains non-ASCII characters; if you do need to use non-ASCII characters in your scripts, save them as UTF-8 with BOM.
      Without the BOM, Windows PowerShell (mis)interprets your script as being encoded in the legacy "ANSI" codepage (determined by the system locale for pre-Unicode applications; e.g., Windows-1252 on US-English systems).
  • Conversely, files that do have the UTF-8 pseudo-BOM can be problematic on Unix-like platforms, as they cause Unix utilities such as cat, sed, and awk - and even some editors such as gedit - to pass the pseudo-BOM through, i.e., to treat it as data.

    • This may not always be a problem, but definitely can be, such as when you try to read a file into a string in bash with, say, text=$(cat file) or text=$(<file) - the resulting variable will contain the pseudo-BOM as the first 3 bytes.

Inconsistent default encoding behavior in Windows PowerShell:

Regrettably, the default character encoding used in Windows PowerShell is wildly inconsistent; the cross-platform PowerShell Core edition, as discussed in the previous section, has commendably put and end to this.

Note:

  • The following doesn't aspire to cover all standard cmdlets.

  • Googling cmdlet names to find their help topics now shows you the PowerShell Core version of the topics by default; use the version drop-down list above the list of topics on the left to switch to a Windows PowerShell version.

  • Historically, the documentation frequently incorrectly claimed that ASCII is the default encoding in Windows PowerShell; fortunately, this has since been corrected.


Cmdlets that write:

Out-File and > / >> create "Unicode" - UTF-16LE - files by default - in which every ASCII-range character (too) is represented by 2 bytes - which notably differs from Set-Content / Add-Content (see next point); New-ModuleManifest and Export-CliXml also create UTF-16LE files.

Set-Content (and Add-Content if the file doesn't yet exist / is empty) uses ANSI encoding (the encoding specified by the active system locale's ANSI legacy code page, which PowerShell calls Default).

Export-Csv indeed creates ASCII files, as documented, but see the notes re -Append below.

Export-PSSession creates UTF-8 files with BOM by default.

New-Item -Type File -Value currently creates BOM-less(!) UTF-8.

The Send-MailMessage help topic also claims that ASCII encoding is the default - I have not personally verified that claim.

Start-Transcript invariably creates UTF-8 files with BOM, but see the notes re -Append below.

Re commands that append to an existing file:

>> / Out-File -Append make no attempt to match the encoding of a file's existing content. That is, they blindly apply their default encoding, unless instructed otherwise with -Encoding, which is not an option with >> (except indirectly in PSv5.1+, via $PSDefaultParameterValues, as shown above). In short: you must know the encoding of an existing file's content and append using that same encoding.

Add-Content is the laudable exception: in the absence of an explicit -Encoding argument, it detects the existing encoding and automatically applies it to the new content.Thanks, js2010. Note that in Windows PowerShell this means that it is ANSI encoding that is applied if the existing content has no BOM, whereas it is UTF-8 in PowerShell Core.

This inconsistency between Out-File -Append / >> and Add-Content, which also affects PowerShell Core, is discussed in GitHub issue #9423.

Export-Csv -Append partially matches the existing encoding: it blindly appends UTF-8 if the existing file's encoding is any of ASCII/UTF-8/ANSI, but correctly matches UTF-16LE and UTF-16BE.
To put it differently: in the absence of a BOM, Export-Csv -Append assumes UTF-8 is, whereas Add-Content assumes ANSI.

Start-Transcript -Append partially matches the existing encoding: It correctly matches encodings with BOM, but defaults to potentially lossy ASCII encoding in the absence of one.


Cmdlets that read (that is, the encoding used in the absence of a BOM):

Get-Content and Import-PowerShellDataFile default to ANSI (Default), which is consistent with Set-Content.
ANSI is also what the PowerShell engine itself defaults to when it reads source code from files.

By contrast, Import-Csv, Import-CliXml and Select-String assume UTF-8 in the absence of a BOM, and so does the switch statement with its -File parameter.

Lear answered 18/10, 2016 at 3:12 Comment(25)
Can you explain how>/>> became effective aliases for Out-File in 5.1?Chiasmus
@TheIncorrigible1: It may have been PetSerAl who pointed it out to me, but I don't remember where and how. Windows PowerShell is closed-source, but since the same quasi-alias relationship applies to PowerShell Core too, you should be able to find it in the latter's source code.Lear
Is there any way to force to not prepend te BOM on Win10?Despondency
@Mvorisek: In Windows PowerShell, you can't - you have to roll your own output function - see https://mcmap.net/q/15782/-using-powershell-to-write-a-file-in-utf-8-without-the-bom. In PowerShell Core (also on Windows), BOM-less is the default.Lear
It's PS 6 that is utf8nobom by default. PS 5.1 is "ansi" for most commands besides out-file, which is "unicode".Lanford
@js2010: Unfortunately, it's much more complicated than that - please see the bottom section I've just added to the answer. While you can also use the version number to imply a PowerShell edition (6 or higher implying Core), it's clearer to refer to them as Windows PowerShell (Windows-only, .NET Framework-based) and PowerShell Core (cross-platform, .NET Core-based).Lear
I don't disagree, @EliaWeiss, but it's Windows PowerShell specifically, and they eventually did get it right in PowerShell Core.Lear
Why does windows need a BOM to recognize UTF-8 properly but Linux doesn't? The whole point of UTF-8 IMHO is that you can have non-ASCII characters and not worry about it,Lati
@Marc: Windows decided to remain backward-compatible, which means that files without a BOM are by default interpreted as using the system's active ANSI code page, i.e., typically using a single-byte, 8-bit character encoding such as Windows-1252 for English-language systems and many Western European languages. Since the 8-bit range of such encodings is incompatible with UTF-8, you need the BOM to disambiguate. For instance, a Windows-1252-encoded file that contains character ü would be invalid when interpreted as UTF-8.Lear
Strange, because I can save a file in Windows as UTF-8 w/o BOM including Japanese characters and it opens fine in notepad, vscode etc.Lati
@Marc: VS Code and other modern cross-platform editors commendably default to UTF-8, which, however, means they'll misinterpret ANSI-encoded files. Notepad uses heuristics to guess the encoding. The point is that it is only a guess, because any UTF-8-encoded file is also a technically valid ANSI-encoded file (but not vice versa). It would be great if everything on Windows defaulted to UTF-8 in the absence of a BOM the way Unix-like platforms do, but that's not the case, notably not in Windows PowerShell, though fortunately it is now the case in PowerShell Core.Lear
@Marc: P.S.: When you create a file in Notepad from scratch, it still defaults to ANSI encoding.Lear
Since ANSI is a subset of UTF-8 i.e. a valid ANSI file is a valid UTF-8 file they won't "misinterpret" it exactly but treat it as something it isn't. But thanks for your clarification. I'm writing an application for "all" platforms and I'd like to choose a format which works everywhere. BOM-less UTF-8 seems the way to go. Thanks again!Lati
@Marc: Yes, if you need to be cross-platform, BOM-less UTF-8 is generally the best choice. I see no semantic difference between misinterpret and treat it as something it isn't. ANSI is not a subset of UTF-8, only ASCII is. Therefore, an ANSI file (with 8-bit-range characters) interpreted as UTF-8 will typically result in those characters being considered invalid, and converted to (U+FFFD), the REPLACEMENT CHARACTERLear
P.S., @Marc: Conversely, a UTF-8 file interpreted as ANSI will result in every character outside the ASCII Unicode range being converted to 2-4 unrelated characters.Lear
Out-file -append (or >>) can mix 2 encodings in the same file. I wouldn't use it. Aside from it uniquely defaulting to unicode in PS 5.Lanford
Some more details. Apparently out-file and > and >> are meant to emulate unix. github.com/PowerShell/PowerShell/issues/…Lanford
Thanks, @Lanford - the answer already contains a link to your issue; my comment there wasn't meant to imply that emulating Unix was the design intent - I can't speak to that - I was only pointing out that in the Unix world too >> blindly applies the default encoding; I do agree that Add-Content's behavior is more helpful.Lear
To watch your current value if some, just type $PSDefaultParameterValuesMackenziemackerel
@Lear After looking at this answer for a related issue. I found that the chcp.com command reports 437 although everything else reports 65001 (and UTF-8). So is chcp.com a dead animal now? If not when should/can it be used?Knapp
@not2qubit: What chcp reports depends solely on [Console]::InputEncoding. You cannot use chcp.com from inside PowerShell, due to .NET's caching of encodings, but you can use it in cmd.exe, where it is also effective if you launch PowerShell later from there.Lear
Is there a way for scripts in ps 5.1 to default to utf8 no bom instead of ansi?Lanford
@js2010, please see the third bullet point I've just added to the top section.Lear
I just mean the default encoding the engine itself assumes when running a powershell script. Frequently users have utf8nobom scripts.Lanford
@js2010, the only way to enable that, as far as I know, is to make the system-wide change described, which comes with far-reaching consequences.Lear
I
4

To be short, use:

write-output "your text" | out-file -append -encoding utf8 "filename"

You may want to put parts of the script into braces so you could redirect output of few commands:

{
  command 1
  command 2
} | out-file -append -encoding utf8 "filename"
Isoagglutinin answered 24/5, 2020 at 15:17 Comment(2)
To quote from the question: "It can be done on a case-by-case basis by replacing the >foo.txt syntax with | out-file foo.txt -encoding utf8 but this is awkward to have to repeat every time." In other words: you're suggesting precisely what the OP is trying to avoid.Lear
i think -append should be removedUnderclothing
P
1

A dump made using PowerShell on Windows with output redirection creates a file that has UTF-16 encoding. To work around this issue, you can try:

mysqldump.exe [options] --result-file=dump.sql

Reference link: mysqldump_result-file

Pollack answered 3/10, 2022 at 9:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.