Why does PowerShell redirection >> change the formatting of the text content?
Asked Answered
P

3

4

I want to use the redirect append >> or write > to write to a txt file, but when I do, I receive a weird format "\x00a\x00p...".

I successfully use Set-Content and Add-Content, why do they function as expected, but not the >> and > redirect operators?

Showing the output using PowerShell cat as well as simple Python print.

rocket_brain> new-item test.txt
rocket_brain> "appended using add-content" | add-content test.txt
rocket_brain> cat test.txt

 appended using add-content

but then if I use redirect append >>

rocket_brain> "appended using redirect" >> test.txt
rocket_brain> cat test.txt

 appended using add-content
 a p p e n d e d   u s i n g   r e d i r e c t

Simple Python script: read_test.py

with open("test.txt", "r") as file:   # open test.txt in readmode
    data = file.readlines()           # append each line to the list data
    print(data)                       # output list with each input line as an item

Using read_test.py I see a difference in formatting

rocket_brain> python read_test.txt
 ['appended using add-content\n', 'a\x00p\x00p\x00e\x00n\x00d\x00e\x00d\x00 \x00u\x00s\x00i\x00n\x00g\x00 \x00r\x00e\x00d\x00i\x00r\x00e\x00c\x00t\x00\r\x00\n', '\x00']

NOTE: If I use only the redirect append >> (or write >) without first using Add-Content, the cat output looks normal (instead of spaced out), but I will then get the /x00p format for every line when using the Python script (including any Add-Content command after starting with > operators). Opening the file in Notepad (or VS etc), the text always looks as expected. Using >> or > in cmd (instead of PS) also stores text in expected ascii format.

Related links: cmd redirection operators, PS redirection operators

Polyphagia answered 9/7, 2019 at 1:34 Comment(4)
i THINK the problem is that the redirection operators in PoSh do not properly [at all?] take into account the target file encoding ... so you are likely seeing two-byte chars in what should have been a one-byte-per-char file. ///// if i create a file via redirection, the file has normal text. if i create the file via Set-Content & use redirection to add to it ... i see what you are seeing [null] chars.Watkins
ah, I hadn't even realized they were two-byte outputs with the null before each letter... That helps a lot. Do you know/think there is a way to set the operators to specific encodings, or would it be better just to ignore the > operators until future support/I am more comfortable with PS?Polyphagia
Note: spaces between U+0000 to U+00FF codepoints is a symptom of reading text encoded with UTF-16LE with a character encoding that generally uses one byte per codepoint. /x00p or /x00/x70 is proof.Memnon
@Polyphagia - looks like mklement0 has your answer ... nicely done, too. [grin]Watkins
P
8

Note: The problem is ultimately that in Windows PowerShell different cmdlets / operators use different default encodings. This problem has been resolved in PowerShell (Core) 7+, where BOM-less UTF-8 is consistently used.


  • >> blindly applies Out-File's default encoding when appending to an existing file (in effect, > behaves like Out-File and >> like Out-File -Append), which in Windows PowerShell is the encoding named Unicode, i.e., UTF-16LE, where most characters are encoded as 2-byte sequences, even those in the ASCII range; the latter have a 0x0 (NUL) as the high byte.

    • Therefore, unless the target file's existing contents use the same encoding, you'll end up with a mix of different encodings, which is what happened in your case.[1]
  • While Add-Content, by contrast, does try to detect a file's existing encodingThanks again, js2010., you used it on an empty file, in which case Add-Content uses the same default as Set-Content, which in Windows PowerShell is the encoding named Default, which refers to your system's active legacy ANSI code page.

  • Therefore, to match the single-byte ANSI encoding initially created by your Add-Content call when appending further content, use Out-File -Append -Encoding Default instead of >>, or simply keep using Add-Content.

    • Alternatively, pick a different encoding with Set-Content / Add-Content -Encoding ... and match it in the Out-File -Append call; UTF-8 is generally the best choice, though note that when you create a UTF-8 file in Windows PowerShell, it will start with a BOM, a (pseudo) byte-order mark identifying the file as UTF-8, in the form of 3 bytes at the start of the file, which Unix-like platforms typically do not expect.
      See this answer for workarounds that create BOM-less UTF-8 files.

    • In Windows PowerShell v5.1 (the latest and last version), you may also change the default encoding globally, including for > and >> (which isn't possible in earlier versions) - use with caution, given that every call to a cmdlet that supports an -Encoding parameter will then implicitly use the configured encoding. To change to UTF-8, for instance, use:
      $PSDefaultParameterValues['*:Encoding']='UTF8'.
      To limit the change to Out-File / > / >> only:
      $PSDefaultParameterValues['Out-File:Encoding']='UTF8'.

      • As noted, in Windows PowerShell this will create UTF-8 files with a BOM, and creating BOM-less UTF-8 files requires workarounds.

      • The above technique also works in PowerShell 7+, but given that BOM-less UTF-8 is the consistent default to begin with, this is probably not needed. (In the unlikely event that you want to create UTF-8 files with a BOM, use 'utf8BOM' in the above assignment).


Aside from different default encodings (in Windows PowerShell), it is important to note that Set-Content / Add-Content on the one hand and > / >> / Out-File [-Append] on the other behave fundamentally differently with non-string input:

In short: the former apply simple .ToString()-formatting to the input objects, whereas the latter perform the same output formatting you would see in the console - see this answer for details.


[1] Due to the initial content set by Add-Content, Windows PowerShell interprets the file as ANSI-encoded (the default in the absence of a BOM), where each byte is its own character. The UTF-16 content appended later is therefore also interpreted as if it were ANSI, so the 0x0 bytes are treated like characters in their own right, which print to the console like spaces.

Pinafore answered 9/7, 2019 at 2:17 Comment(6)
How would Add-Content and >> know the character encoding of an existing text file? They could guess but that would be worse than having a default. As you suggest, you could tell them. Problem solved.Memnon
Would you recommend installing PowerShell Core for a casual shell programmer?Polyphagia
@rocket_brain. Yes - PowerShell Core is the future, so unless you have a need for functionality that is still only available in Windows PowerShell, I recommend switching. The upcoming PowerShell [Core] version 7 will try to bring most of the still-missing functionality to PowerShell Core, which will make Windows PowerShell largely obsolete (it is only receiving bug fixes at this point, no new features).Pinafore
@TomBlodget: As js2010 points out, Add-Content (but not >>) actually does try to match the existing encoding...Pinafore
I had to apply ascii for my Notepad file. PowerShell's Out-File uses utf8NoBOM by default and VSCode uses it too.Crackbrain
@ToddPartridge, only PowerShell (Core) 7+ uses utf8NoBOM by default, across all cmdlets (commendably so, in this day and age), so if your file's content comprises only ASCII-range characters, such a file is by definition also an ASCII file. By contrast, in Windows PowerShell Out-File (as well as > and >>) creates "Unicode" (UTF-16LE) files by default, which can cause problems.Pinafore
P
1

>> or out-file -append will append unicode text by default, even if the file isn't unicode in the first place. Add-content will check the encoding of the file first, and match it. Add-content or set-content defaults to ansi encoding as well. I would never use >, >>, or out-file.

Seeing something with spaces in between is a giveaway that it's unicode. Unicode has $nulls between each letter usually. If you dump the hex, like in emacs esc-x hexl-mode, you can see it. Boms are 2 or 3 hex characters in the beginning of a file.

a p p e n d e d   u s i n g   r e d i r e c t

This is a correctly constructed unicode text file, copied and pasted from emacs hexl-mode. fffe is the bom. After each character is 00. At the end is 0d and 0a, carriage return and linefeed. Stuff like this interests me. It's possible for some windows utilities to make a unicode text file with no bom (icacls /save). Then if you type the file, the letters will appear to have spaces in-between.

00000000: fffe 6100 7000 7000 6500 6e00 6400 6500  ..a.p.p.e.n.d.e.
00000010: 6400 2000 7500 7300 6900 6e00 6700 2000  d. .u.s.i.n.g. .
00000020: 7200 6500 6400 6900 7200 6500 6300 7400  r.e.d.i.r.e.c.t.
00000030: 0d00 0a00                                ....
Pontoon answered 9/7, 2019 at 3:31 Comment(1)
I noticed when I reversed the order (>> before add-content), it unified the format to UTF-16. It only further confused me then. Now it is much clearer. Thanks!Polyphagia
D
0

>> and > redirects console output. So i assume that would also include some weird characters sometimes. >> and > are more closely related to the Out-File cmdlet.

add-content does not forward console output to a file, It only writes the values you provide it (e.g. a variable or pipeline object)

about_redirection

Deterioration answered 9/7, 2019 at 1:53 Comment(1)
There is no concept of console output with respect to redirections in PowerShell. Out-File and its virtual aliases, > and >>, as well as Set-Content / Add-Content all operate on pipeline input. That is, they all write the same data to the target file, albeit potentially formatted differently - see this answer for details.Pinafore

© 2022 - 2024 — McMap. All rights reserved.