Import-Csv / Export-Csv with german umlauts (ä,ö,ü)
Asked Answered
A

1

8

i came across a little issue when dealing with csv-exports which contains mutated vowels like ä,ö,ü (German Language Umlaute)

i simply export with

Get-WinEvent -FilterHashtable @{Path=$_;ID=4627} -ErrorAction SilentlyContinue |export-csv -NoTypeInformation -Encoding Default -Force ("c:\temp\CSV_temp\"+ $_.basename + ".csv")

which works fine. i have the ä,ö,ü in my csv-file correctly.

after that i do a little sorting with:

Get-ChildItem 'C:\temp\*.csv' |
    ForEach-Object { Import-Csv $_.FullName } |
    Sort-Object { [DateTime]::ParseExact($_.TimeCreated, $pattern, $culture) } |
    Export-Csv 'C:\temp\merged.csv' -Encoding Default -NoTypeInformation -Force

i played around with all encodings, ASCII, BigEndianUnicode, UniCode(s) with no success.

how can i preserve the special characters ä,ö,ü and others when exporting and sorting?

Armand answered 23/2, 2018 at 11:46 Comment(5)
Make sure you supply the correct encoding to Import-Csv as wellSilica
Awesome, i simply missed the encoding at the inbound of the foreach import object. Thank you so much, it works perfectly.Armand
The fundamental rule of character encodings is to read bytes that represent text with the same encoding that it was written with. Default varies from machine to machine, user to user and even time to time, so it is very rarely one that a writer would want to use.Mullein
@TomBlodget: In the (legacy) Windows world, the default encoding varies by culture and is set at the machine level (the Windows "ANSI" code page, ACP). While such an - unmarked - culture-dependent encoding is increasingly problematic, it is still very common, and Windows PowerShell in part defaults to it - notably with the basic Get-Content and Set-Content cmdlets; despite even naming the ACP encoding Default when used with the -Encoding parameter, however, other cmdlets do not use it by default, which is the truly problematic aspect here.Maleficent
Sure it's troublesome when standard cmdlets don't have a consistent behavior. But, look at it this way: Since 1996, the Windows OS has used the Unicode character set. So, if your text source is the OS, such as via Get-WinEvent, and you don't use a Unicode encoding, such as UTF-8, you potentially get silent data loss. So, one strategy is to just specify UTF-8 everywhere.Mullein
M
7

Mathias R. Jessen provides the crucial pointer in a comment on the question:

It is the Import-Csv call, not Export-Csv, that is the cause of the problem in your case:

Like Export-Csv, Import-Csv too needs to be passed -Encoding Default in order to properly process text files encoded with the system's active "ANSI" legacy code page, which is an 8-bit, single-byte character encoding such as Windows-1252.

In Windows PowerShell, even though the generic text-file processing Get-Content / Set-Content cmdlet pair defaults to Default encoding (as the name suggests), regrettably and surprisingly, Import-Csv and Export-Csv do not.

Note that on reading a default encoding is only assumed if the input file has no BOM (byte-order mark, a.k.a Unicode signature, a magic byte sequence at the start of the file that unambiguously identifies the file's encoding).

Not only do Import-Csv and Export-Csv have defaults that differ from Get-Content / Set-Content, they individually have different defaults:

  • Import-Csv defaults to UTF-8.
  • Export-Csv defaults to ASCII(!), which means that any non-ASCII characters -such as ä, ö, ü - are transliterated to literal ? chars., resulting in loss of data.

By contrast, in PowerShell Core, the cross-platform edition built on .NET Core, the default encoding is (BOM-less) UTF-8, consistently, across all cmdlets, which greatly simplifies matters and makes it much easier to determine when you do need to use the -Encoding parameter.


Demonstration of the Windows PowerShell Import-Csv / Export-Csv behavior

Import-Csv - defaults to UTF-8:

# Sample CSV content.
$str = @'
Column1
aäöü
'@

# Write sample CSV file 't.csv' using UTF-8 encoding *without a BOM*
# (Note that this cannot be done with standard PowerShell cmdlets.)
$null = new-item -type file t.csv -Force
[io.file]::WriteAllLines((Convert-Path t.csv), $str)

# Use Import-Csv to read the file, which correctly preserves the UTF-8-encoded
# umlauts
Import-Csv .\t.csv

The above yields:

Column1
-------
aäöü

As you can see, the umlauts were correctly preserved.

By contrast, had the file been "ANSI"-encoded ($str | Set-Content t.csv; -Encoding Default implied), the umlauts would have gotten corrupted.


Export-Csv - defaults to ASCII - risk of data loss:

Building on the above example:

Import-Csv .\t.csv | Export-Csv .\t.new.csv
Get-Content .\t.new.csv

yields:

"Column1"
"a???"

As you can see, the umlauts were replaced by literal question marks (?).

Maleficent answered 25/2, 2018 at 18:17 Comment(2)
Export-Csv defaults to ASCII(!) - insane, but true! Thanks for this detailed answer.Phocine
Yeah, @Andre, it's quite unfortunate; glad to hear the answer helped. Thankfully, PowerShell (Core) 7+ now uses UTF-8, consistently across all cmdlets.Maleficent

© 2022 - 2024 — McMap. All rights reserved.