Native shell command set to extract node value from XML
Asked Answered
A

6

31

I'm trying to extract the value of a node from a pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project>
    <parent>
        <groupId>org.me.labs</groupId>
        <artifactId>my-random-project</artifactId>
        <version>1.5.0</version>
    </parent>
    ...
</project>

I need to extract the artifactId and version from the XML using a shell command. I have the following requirements/observations:

  1. The shell script will be done within a build assembly file we use at work, so the smaller the script the better.
  2. Since it'll be used on multiple systems (usually RHEL5), I'm looking for something that can run natively on default images.
  3. Tags like can occur elsewhere in the pom, so I can't simply awk for those tags.

I have tried the following:

  1. xpath works on my Mac, but isn't available by default on RHEL machines. Similarly for xmllint --xpath, which I guess is only available on later versions of xmllint, which I don't have and can't enforce.
  2. xmllint --pattern seemed promising, but I can't seem to get an output out of xmllint --pattern '//project/parent/version' pom.xml (prints entire XML) or xmllint --stream --pattern '//project/parent/version' pom.xml (no output).

I realize this is a common question here on SO, but the points above are why I can't use those answers. TIA for your help.

Accidental answered 6/6, 2013 at 10:35 Comment(0)
A
18

I've managed to solve it for the time being with this rather unwiedly script using xmllint --shell.

echo "cat //project/parent/version" | xmllint --shell pom.xml | sed '/^\/ >/d' | sed 's/<[^>]*.//g'

If the XML nodes have namespace attributes like my pom.xml had, things get heavier, basically extracting the node by name:

echo "cat //*[local-name()='project']/*[local-name()='parent']/*[local-name()='version']" | xmllint --shell pom.xml | sed '/^\/ >/d' | sed 's/<[^>]*.//g'

Hope it helps. If anyone can simply these expressions, I'd be grateful.

Accidental answered 6/6, 2013 at 12:34 Comment(3)
Alternatively, you can use this: echo "cat //*[local-name()='project']/*[local-name()='parent']/*[local-name()='version']/text()" | xmllint --shell pom.xml | sed '/^\/ >/d', so you only need to sed-remove the xmllint shell stuffHasdrubal
If you have a recent enough xmllint, then you don't need the --shell stuff: xmllint --xpath /*[local-name()=="project"]/...' pom.xml. The local-name() part was what I was missing for my script.Moonscape
Thank you for you answer echo "cat //*[local-name()='project']/*[local-name()='parent']/*[local-name()='version']" | xmllint --shell pom.xml | sed '/^\/ >/d' | sed 's/<[^>]*.//g' Mcinnis
P
29

--format is used only to format (indent, etc) the document. You can do that using --xpath (tested in Ubuntu, libxml v20900):

$ xmllint --xpath "//project/parent/version/text()" pom.xml
1.5.0
Proven answered 6/6, 2013 at 10:53 Comment(5)
Like I said, my version of xmllint doesn't seem to support the --xpath option. And I don't want to chance that it'll be available on my build systems.Accidental
Oh sorry I didn't noticed. python/libxml2 is an option?Proven
Also: xpath -q -e "//project/parent/version/text()" pom.xmlProven
I'm trying to stay away from third party libraries (libxml2) or any tools (xpath) I can't guarantee will be available on a Linux machine.I guess if no combination of the native tools can be used, I'll have to hack it somehow.Accidental
With regard to your answer, I meant xmllint --pattern. I've made the change in the post.Accidental
A
18

I've managed to solve it for the time being with this rather unwiedly script using xmllint --shell.

echo "cat //project/parent/version" | xmllint --shell pom.xml | sed '/^\/ >/d' | sed 's/<[^>]*.//g'

If the XML nodes have namespace attributes like my pom.xml had, things get heavier, basically extracting the node by name:

echo "cat //*[local-name()='project']/*[local-name()='parent']/*[local-name()='version']" | xmllint --shell pom.xml | sed '/^\/ >/d' | sed 's/<[^>]*.//g'

Hope it helps. If anyone can simply these expressions, I'd be grateful.

Accidental answered 6/6, 2013 at 12:34 Comment(3)
Alternatively, you can use this: echo "cat //*[local-name()='project']/*[local-name()='parent']/*[local-name()='version']/text()" | xmllint --shell pom.xml | sed '/^\/ >/d', so you only need to sed-remove the xmllint shell stuffHasdrubal
If you have a recent enough xmllint, then you don't need the --shell stuff: xmllint --xpath /*[local-name()=="project"]/...' pom.xml. The local-name() part was what I was missing for my script.Moonscape
Thank you for you answer echo "cat //*[local-name()='project']/*[local-name()='parent']/*[local-name()='version']" | xmllint --shell pom.xml | sed '/^\/ >/d' | sed 's/<[^>]*.//g' Mcinnis
C
6

I came here looking for a nice way to scrape a value from a website. The following example may be useful to those (unlike the poster) who have a version of xmllint which supports --xpath.

I needed to pull the most recent stable version of the elasticsearch .debfile and install it. The maintainers have helpfully put the version number in a span with the class "version".

version=`curl -s http://www.elasticsearch.org/download/ |\
 xmllint --html --xpath '//span[@class="version"]/text()'\
 2>/dev/null - `;

What goes on:

We use the curl -s (silent) option.

curl -s http://www.elasticsearch.org/download/

We use the xmllint --html and --xpath switches. The xpath arguments (in single quotes)

'//span[@class="version"]/text()'

... looks for a <span> node with the class attribute (@class) "version", and extracts the text value (/text()).

Since xmllint is (surprise!) a linter, it will squawk about the inevitable garbage in your html stream. We direct the stderr to /dev/null in the usual way:

 2>/dev/null

Finally, note the " - " at the end of the xmllint command, which tells xmllint the stream is coming from stdin.

Coeval answered 5/12, 2013 at 16:36 Comment(1)
Karthik. V, this is not a good answer for you, but your question is well-named, so it's pretty high up in a google search. I thought I'd add this for people like me who are looking for a quick answer and have different tools.Coeval
M
3

Using the text() XPath function gives you the element value, rather than having to remove the XML tags:

echo "cat //project/parent/version/text()" | xmllint --shell pom.xml
Monster answered 6/11, 2013 at 0:36 Comment(1)
Sorry text() doesn't work nor does /value/text() What version of libxml2 are you using? I have 2.7.6Undersheriff
H
1

You can try

xmllint --xpath "/*[name()='project']/*[name()='groupId']/text()" pom.xml

Holtz answered 17/10, 2017 at 14:52 Comment(1)
It worked fine. I tried earlier -xpath is unknow. I just copied your answer and mondified as per my requirement and It working....!!!Odyl
L
0

With POMs you may issue problems with namespaces which prevent xmllint to work as expected. This articles points you to an alternative and very good solution (look at sed paragraph).

Libertine answered 23/4, 2018 at 6:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.