How to load or read an XML file using ConvertTo-Xml and Select-Xml?
Asked Answered
P

1

4

How can I accomplish something like this:

PS /home/nicholas/powershell> 
PS /home/nicholas/powershell> $date=(Get-Date | ConvertTo-Xml)                                         
PS /home/nicholas/powershell> 
PS /home/nicholas/powershell> $date

xml                            Objects
---                            -------
version="1.0" encoding="utf-8" Objects

PS /home/nicholas/powershell> 
PS /home/nicholas/powershell> $date.OuterXml
<?xml version="1.0" encoding="utf-8"?><Objects><Object Type="System.DateTime">12/12/2020 2:43:46 AM</Object></Objects>
PS /home/nicholas/powershell> 

but, instead, reading in a file?


how do I load/import/read/convert an xml file using ConvertTo-Xml for parsing with Select-Xml using Xpath?

PS /home/nicholas/powershell> 
PS /home/nicholas/powershell> $xml=ConvertTo-Xml ./bookstore.xml
PS /home/nicholas/powershell> 
PS /home/nicholas/powershell> $xml                              

xml                            Objects
---                            -------
version="1.0" encoding="utf-8" Objects

PS /home/nicholas/powershell> 
PS /home/nicholas/powershell> $xml.InnerXml                     
<?xml version="1.0" encoding="utf-8"?><Objects><Object Type="System.String">./bookstore.xml</Object></Objects>
PS /home/nicholas/powershell> 
PS /home/nicholas/powershell> $xml.OuterXml                     
<?xml version="1.0" encoding="utf-8"?><Objects><Object Type="System.String">./bookstore.xml</Object></Objects>
PS /home/nicholas/powershell> 
PS /home/nicholas/powershell> cat ./bookstore.xml

<?xml version="1.0"?>
<!-- A fragment of a book store inventory database -->
<bookstore xmlns:bk="urn:samples">
  <book genre="novel" publicationdate="1997" bk:ISBN="1-861001-57-8">
    <title>Pride And Prejudice</title>
    <author>
      <first-name>Jane</first-name>
      <last-name>Austen</last-name>
    </author>
    <price>24.95</price>
  </book>
  <book genre="novel" publicationdate="1992" bk:ISBN="1-861002-30-1">
    <title>The Handmaid's Tale</title>
    <author>
      <first-name>Margaret</first-name>
      <last-name>Atwood</last-name>
    </author>
    <price>29.95</price>
  </book>
  <book genre="novel" publicationdate="1991" bk:ISBN="1-861001-57-6">
    <title>Emma</title>
    <author>
      <first-name>Jane</first-name>
      <last-name>Austen</last-name>
    </author>
    <price>19.95</price>
  </book>
  <book genre="novel" publicationdate="1982" bk:ISBN="1-861001-45-3">
    <title>Sense and Sensibility</title>
    <author>
      <first-name>Jane</first-name>
      <last-name>Austen</last-name>
    </author>
    <price>19.95</price>
  </book>
</bookstore>

PS /home/nicholas/powershell> 

Creating the xml file within the REPL console itself works as expected:

How to parse XML in Powershell with Select-Xml and Xpath?

Photosensitive answered 12/12, 2020 at 10:35 Comment(6)
$xml = [xml]( Get-Content .\bookstore.xml -raw ); $xml | Select-Xml YourXPathBenefic
@Benefic No, don't use Get-Content and cast the result to XML. This is the single most common error I see when people read XML in PowerShell. Use $doc = New-Object xml; $doc.Load('path.to.xml');. This deals with file encodings properly. Using Get-Content happily mangles your data.Consummation
@Consummation Even with Get-Content -raw?Benefic
@Benefic Yeah, even then. See my answer for the gist of it.Consummation
@Consummation Got it. Propably just got lucky because most XML documents are UTF-8 encoded, which happens to be the default encoding used by Get-Content.Benefic
@Benefic Nowadays. Earlier versions of PS defaulted to whatever "ANSI" default encoding your system had, in Europe/the US likely Windows-1252. Get-Content pays attention to the BOM, so it will recognize UTF-16 unaided, but UTF-8 downloaded from the Internet usually has no BOM. And Get-Content will continue to butcher "foreign" single-byte encodings. Ultimately, it really is luck when it works. And it's entirely unnecessary to rely on luck with XML when transparent encoding detection is a fundamental part of the spec.Consummation
C
15

Properly reading an XML document in Powershell works like this:

$doc = New-Object xml
$doc.Load( (Convert-Path bookstore.xml) )

XML can come in numerous file encodings, and using the XmlDocument.Load method makes sure the file is read properly without prior knowledge of the encoding.

Not reading a file with the correct encoding will result in mangled data or errors except in very basic or very lucky cases.

The often-seen method of using Get-Content and casting the resulting string to [xml] is the wrong way of dealing with XML for this very reason. So don't do that.

You can get a correct result with Get-Content, but that requires

  1. Prior knowledge of the file encoding (e.g. Get-Content bookstore.xml -Encoding UTF8)
  2. Hard-coding the file encoding into your script (meaning it will break if the XML encoding ever changes unexpectedly)
  3. Limiting yourself to the very few file encodings that Get-Content supports (XML supports more)

It means you put yourself in a position where you have to manually think about and solve a problem that XML has been specifically designed to automatically handle for you.

Doing things correctly with Get-Content is a lot of unnecessary extra work and limitations. And doing things incorrectly is pointless when doing it right is so easy.


Examples, after loading $doc like shown above.

$doc.bookstore.book

prints a list of <book> elements and their properties

genre           : novel
publicationdate : 1997
ISBN            : 1-861001-57-8
title           : Pride And Prejudice
author          : author
price           : 24.95

genre           : novel
publicationdate : 1992
ISBN            : 1-861002-30-1
title           : The Handmaid's Tale
author          : author
price           : 29.95

genre           : novel
publicationdate : 1991
ISBN            : 1-861001-57-6
title           : Emma
author          : author
price           : 19.95

genre           : novel
publicationdate : 1982
ISBN            : 1-861001-45-3
title           : Sense and Sensibility
author          : author
price           : 19.95

$doc.bookstore.book | Format-Table

prints the same thing as a table

genre publicationdate ISBN          title                 author price
----- --------------- ----          -----                 ------ -----
novel 1997            1-861001-57-8 Pride And Prejudice   author 24.95
novel 1992            1-861002-30-1 The Handmaid's Tale   author 29.95
novel 1991            1-861001-57-6 Emma                  author 19.95
novel 1982            1-861001-45-3 Sense and Sensibility author 19.95

$doc.bookstore.book | Where-Object publicationdate -lt 1992 | Format-Table

filters the data

genre publicationdate ISBN          title                 author price
----- --------------- ----          -----                 ------ -----
novel 1991            1-861001-57-6 Emma                  author 19.95
novel 1982            1-861001-45-3 Sense and Sensibility author 19.95

$doc.bookstore.book | Where-Object publicationdate -lt 1992 | Sort publicationdate | select title

sorts and prints only the <title> field

title                
-----                
Sense and Sensibility
Emma

There are many more ways of slicing and dicing the data, it all depends on what you want to do.

Consummation answered 12/12, 2020 at 10:57 Comment(17)
but, now it's allononlineoftextandiskindahardtoread. How do print it out nicely?Photosensitive
@Nicholas What do you want to print out nicely? Values from the XML? The XML itself? What's the overall goal you want to achieve?Consummation
it would be convenient to pretty print the raw xml (as with xmllint) if that's built-in to powershell. see also https://mcmap.net/q/393717/-how-to-use-the-cmdlet-select-xml-interactive-prompt/4531180 for ultimate goal. (the printing of xml would just be for convenience.)Photosensitive
I would add that one gets aways with the Get-Content method most of the time only because most XML documents are UTF-8 encoded, which happens to be the default encoding used by Get-Content. Of course this is bad "programming by chance" and should be avoided. I guess most people are using it because they like one-liners. So if we could provide a one-liner for the correct method, this could encourage more people to use it.Benefic
@Nicholas PowerShell runs on top of .NET. Anything that can be done with .NET can be done with PowerShell, give or take. Pretty-printing XML is certainly possible, but I doubt that that's what you really need. You want to work with the contained values somehow, and outputting a nice table of values is both easier and more useful than printing out an indented XML tree. I'll add an example to my answer.Consummation
It's embarassing how many tutorials and even highly up-voted, top-ranked SO answers promote the wrong Get-Content method. E. g. https://mcmap.net/q/380815/-how-to-iterate-through-xml-in-powershellBenefic
see also: powershellmagazine.com/2013/08/19/…Photosensitive
@Benefic It really depends on whether you care about writing correct code or not. What's the best one-liner, the fastest loop, the quickest corner to cut worth when the result is incorrect? People are not using Get-Content because it's a one-liner, but because they don't care (or know) about encodings, because it has always worked on their machine, and because it's all over the Internet and they've just copy-pasted it like the rest of their code. ;) But $doc = New-Object xml; $doc.Load($path) (or $doc = [xml]::new(); $doc.Load($path)) fits on one line, so there's that.Consummation
@Benefic And yeah, it is embarrassing how regularly people get this one wrong. It's a hopeless fight, really, just as hopeless as trying to spread the word that regex cannot handle HTML and that every minute trying to do it anyway is wasted. There are just too many bad examples out there.Consummation
A pipable one-liner would be preferable for me. Select-Xml -Path comes close but strangely enough seems to have the same encoding issue as Get-Content (just tested with a "windows-1251" encoded XML file, containing cyrillic letters, which is no problem for [xml]::Load() method).Benefic
@Benefic That's amazing! I've never tried it, but Select-Xml actually messes this up (I tried, my PS Version is 5.1.18362). This is an actual bug in PowerShell, and an embarrassing one, too.Consummation
...regarding the "pipe-ability" - in a script, I don't think it's a huge drawback to have one more line. Directly on the command line for one-offs... I'd call it a minor inconvenience. Overall, knowing the rules is necessary in order to know when you can break them.Consummation
I've created a bug report for the Select-Xml encoding problem: github.com/PowerShell/PowerShell/issues/14404Benefic
@mklement0 Can XmlDocument.Load() handle PowerShell drives specifications?Consummation
No, .NET APIs know nothing about PowerShell drives (and the PowerShell engine doesn't try to translate them for method calls). The bigger problem is the lack of synchronization of working dirs. between PowerShell and .NET - see github.com/PowerShell/PowerShell/issues/3428 - which alone necessitates passing full paths to .NET methods, and Convert-Path is the right tool for that, due to resolving to a native path - can I suggest you update your answer accordingly?Tab
@Tab Ah, got it. That's exactly the reason why I've used Resolve-Path, interesting to learn that it's the wrong tool. Go ahead and edit!Consummation
@zett42, doing what is undoubtedly the right and most robust thing here is indeed so cumbersome - and obscure - that people will keep taking the [xml] (Get-Content -Raw ...) shortcut, unless we provide a PowerShell-idiomatic alternative that is both robust and convenient: please see github.com/PowerShell/PowerShell/issues/14505Tab

© 2022 - 2024 — McMap. All rights reserved.