Note:
The following contains general information that in a normally functioning PowerShell environment would explain the OP's symptom. That the solution doesn't work in the OP's case is owed to machine-specific causes that are unknown at this point.
This answer is about sending BOM-less UTF-8 to an external program; if you're looking to make your PowerShell console windows use UTF-8 in all respects, see this answer.
To ensure that your Java program receives its input UTF-8-encoded without a BOM, you must set $OutputEncoding
to a System.Text.UTF8Encoding
instance that does not emit a BOM:
# Assigns UTF-8 encoding *without a BOM*.
# PowerShell uses this encoding to encode data piped to external programs.
# $OutputEncoding defaults to ASCII(!) in Windows PowerShell, and more sensibly
# to BOM-*less* UTF-8 in PowerShell [Core] v6+
$OutputEncoding = [Text.UTF8Encoding]::new($false)
Caveats:
Do NOT use the seemingly equivalent New-Object Text.Utf8Encoding $false
, because, due to the bug described in GitHub issue #5763, it won't work if you assign to $OutpuEncoding
in a non-global scope, such as in a script. In PowerShell v4 and below, use
(New-Object Text.Utf8Encoding $false).psobject.BaseObject
as a workaround.
Windows 10 version 1903 and up allow you to set BOM-less UTF-8 as the system-wide default encoding (although note that the feature is still classified as beta as of version 20H2) - see this answer; [fixed in PowerShell 7.1] in PowerShell [Core] up to v7.0, with this feature turned on, the above technique is not effective, due to a presumptive .NET Core bug that causes a UTF-8 BOM always to be emitted, irrespective of what encoding you set $OutputEncoding
to (the bug is possibly connected to GitHub issue #28929); the only solution is to turn the feature off, as shown in imgx64's answer.
If, by contrast, you use [Text.Encoding]::Utf8
, you'll get a System.Text.Encoding.UTF8
instance with BOM - which is what I suspect happened in your case.
Note that this problem is unrelated to the source encoding of any file read by Get-Content
, because what is sent through the PowerShell pipeline is never a stream of raw bytes, but .NET objects, which in the case of Get-Content
means that .NET strings are sent (System.String
, internally a sequence of UTF-16 code units).
Because you're piping to an external program (a Java application, in your case), PowerShell character-encodes the (stringified-on-demand) objects sent to it based on preference variable $OutputEncoding
, and the resulting encoding is what the external program receives.
Perhaps surprisingly, even though BOMs are typically only used in files, PowerShell respects the BOM setting of the encoding assigned to $OutputEncoding
also in the pipeline, prepending it to the first line sent (only).
See the bottom section of this answer for more information about how PowerShell handles pipeline input for and output from external programs, including how it is [Console]::OutputEncoding
that matters when PowerShell interprets data received from external programs.
To illustrate the difference using your sample program (note how using a PowerShell string literal as input is sufficient; no need to read from a file):
# Note the EF BB BF sequence representing the UTF-8 BOM.
# Enclosure in & { ... } ensures that a local, temporary copy of $OutputEncoding
# is used.
PS> & { $OutputEncoding = [Text.Encoding]::Utf8; 'hö' | java Hex }
EF BB BF 68 C3 B6 0D 0A
# Note the absence of EF BB BF, due to using a BOM-less
# UTF-8 encoding.
PS> & { $OutputEncoding = [Text.Utf8Encoding]::new($false); 'hö' | java Hex }
68 C3 B6 0D 0A
In Windows PowerShell, where $OutputEncoding
defaults to ASCII(!), you'd see the following with the default in place:
# The default of ASCII(!) results in *lossy* encoding in Windows PowerShell.
PS> 'hö' | java Hex
68 3F 0D 0A
Note that 3F
represents the literal ?
character, which is what the non-ASCII ö
character was transliterated too, given that it has no representation in ASCII; in other words: information was lost.
PowerShell [Core] v6+ now sensibly defaults to BOM-less UTF-8, so the default behavior there is as expected.
While BOM-less UTF-8 is PowerShell [Core]'s consistent default, also for cmdlets that read from and write to files, on Windows [Console]::OutputEncoding
still reflects the active OEM code page by default as of v7.0, so to correctly capture output from UTF-8-emitting external programs, it must be set to [Text.UTF8Encoding]::new($false)
as well - see GitHub issue #7233.
[string](Get-Content input.txt) | my-program args
– Icken(Get-Content input.txt) |
which passes a single string (with newlines) to the pipeline andGet-Content input.txt |
which passes multiple strings to the pipeline (where each string represents a line). Note that if you pass this to an variable (String[]
) it might be separated by a space or a newline depending on how you display it. Also note that for the later syntax yourmy-program
needs to be able toprocess
each individual item in the pipeline. Given the details in your question I doubt whether your program is actually doing that. – Especiallymy-program
needs to be able toprecess
each item in the pipeline". In my case the program is a java program reading stdin. and the parentheses does not change the text read. – TranslativeGet-Content
doesn't pass any BOM information. If you are not in a normal environment, you should supply details like OS, PowerShell version, etc. To confirm that you really retrieving any BOM information withGet-Content
from your file, please show use the first few lines of your file:Get-Content .\Bom.txt | Select -First 3 | ForEach-Object { "$([Byte[]]$_.ToCharArray())" }
. Please add these details to the question, see also: minimal reproducible example. – Especially41 42 43 0D 0A
. It doesn't matter what the encoding of the file is. What os and powershell version and java version are you? PS 6 & 7 do the same. – Phebephedra$OutputEncoding
is used to determine the encoding. You could try$OutputEncoding = [System.Text.UTF8Encoding]::new($false)
before performing your commands. – WavySet-Content 'D:\textfile.txt' "ABC" -Encoding Ascii; Get-Content 'D:\textfile.txt' -Encoding Byte | ForEach-Object { '{0:X2}' -f $_ }
returns41
42
43
0D
0A
. No BOM whatsoever. As said in my answer check the OutputEncoding you have set in PowerShell and change that to use UFT8 without BOM if needed. – Jaimeejaimeschcp 65001
at some point? In that case, I recommend turning that back tochcp 5129
for English - New Zealand. See here – Jaimeejaimesget-content textfile | format-hex
. It doesn't for me in osx, even if the file has a bom. I'm in ps 7 rc1 though. – PhebephedraGet-Content
never sends a BOM through the pipeline, andFormat-Hex
is a PowerShell command, not an external program such asjava
. A BOM may appear for unrelated reasons, irrespective of where the data came from: It can appear as a side effect of setting$OutputEncoding
to an encoding with a BOM, which causes PowerShell to encode the string sent to external programs with that BOM; AdminOfThings' comment shows the solution that should work in a normal PS environment (there's something unusual going on on one of user's machines). – Jabez