How to append a newline after every match using xmlint --xpath

Asked 30/8, 2013 at 12:28 Answered 9/7, 2021 at 1:38

I have the following HTML code:

<textarea name="command" class="setting-input   fixed-width" rows="9">1</textarea><textarea name="command" class="setting-input   fixed-width" rows="5">2</textarea>

I would like to parse it to receive such output:

1
2

Currently I am using:

xmllint --xpath '//textarea[@name="command"]/text()' --html

but it does not append a newline after each match.

Fred answered 30/8, 2013 at 12:28 Comment(5)

How are you getting the output now? and where did you test it? – Debor 30/8, 2013 at 12:52

@Babai Assuming the above HTML code is available in file f, xmllint --xpath '//textarea[@name="command"]/text()' --html f – Fred 30/8, 2013 at 12:53

Actually I did test it in online tool,and the text are coming in a new line.. so trying to understand where you want to print it.. – Debor 30/8, 2013 at 12:55

@Babai In that case, I guess the tool you are using behaves differently than xmllint. – Fred 30/8, 2013 at 12:57

Hello from the future! This behavior was fixed in libxml2 version 2.9.9, and --xpath now (finally) does the thing you expect when dumping XPath nodes. If you are stuck with an old libxml2, see my answer below for an alternative solution using XMLStarlet. – Rabid 5/2, 2020 at 23:56

Hello from the year 2020!

As of v2.9.9 of libxml, this behavior has been fixed in xmllint itself.

echo \
'<textarea name="command" class="setting-input fixed-width"
 rows="9">1</textarea>
<textarea name="command" class="setting-input fixed-width"
 rows="5">2</textarea>' \
  | xmllint --xpath '//textarea[@name="command"]/text()' --html -

# result:
# 1
# 2

However, if you're using anything older than that, and don't want to build libxml from source just to get the fixed xmllint, you'll need one of the other workarounds here. As of this writing, the latest CentOS 8, for example, is still using a version of libxml (2.9.7) that behaves the way the OP describes.

As I gather from this SO answer, it's theoretically possible to feed a command into the --shell option of older (<2.9.9) versions of xmllint, and this will produce each node on a separate line. However, you end up having to post-process it with sed or grep to remove the visual detritus of shell mode's (human-oriented) output. It's not ideal.

XMLStarlet, if available, offers another solution, but you do need to use xmlstarlet fo to format your HTML fragment into valid XML before using xmlstarlet sel to extract nodes:

echo \
'<textarea name="command" class="setting-input fixed-width"
 rows="9">1</textarea>
<textarea name="command" class="setting-input fixed-width"
 rows="5">2</textarea>' \
  | xmlstarlet fo -H -R \
  | xmlstarlet sel -T -t -v '//textarea[@name="command"]' -n

If the Attempt to load network entity message from the second xmlstarlet invocation annoys you, just add 2>/dev/null at the very end to suppress it (at the risk of suppressing other messages printed to standard error).

The XMLStarlet options explained (see also the user's guide):

fo -H -R — format the output, expecting HTML input, and recovering as much bad input as possible
- this will add an <html> root node, making the fragment in the OP's example valid XML
sel -T -t -v //xpath -n — select nodes based on XPath //xpath
- output plain text (-T) instead of XML
- using the given template (-t) that returns the value (-v) of the node rather than the node itself (allowing you to forgo using text() in the XPath expression)
- finally, add a newline (-n)

Edit(s): Removed half-implemented xmllint --shell solution because it was just bad. Added an XMLStarlet example that actually works with the OP's data.

Rabid answered 17/1, 2018 at 23:15 Comment(5)

The link provided points at the accepted answer on this page. – Actiniform 9/7, 2019 at 11:48

Ha. Good catch. Something else must've been in my clipboard. Fixed now. :) – Rabid 12/7, 2019 at 17:53

If you're parsing someone else's HTML, it's also worth noting that "well-formed" (X)HTML of the variety that xmllint and xmlstartlet can parse without errors seems to be ... a rarity these days. You could try xmllint --html as suggested here, which is slightly more forgiving of the input format. Sometimes even that doesn't work, and I'll take the input HTML one pass through tidy first. Or give up and use a regex, at which point I've got two problems. – Rabid 12/7, 2019 at 17:58

Is it broken again in v2.9.10 ? I have xmllint: using libxml version 20910 and I'm not getting newlines. – Protactinium 5/7, 2021 at 18:29

@Protactinium With the OP's test data, I get the newlines as expected using both 2.9.10 in Debian Bullseye and 2.9.13 from MacPorts. The code modifications to add the newlines are still in there, so it's difficult to surmise what might be happening on your end without more information. – Rabid 9/4, 2023 at 14:18

Try this patch, which provides 2 options:

--xpath: same as old --xpath, with nodes separated by \n.
--xpath0: same as old --xpath, with nodes separated by \0.

Test input (a.html):

<textarea name="command" class="setting-input   fixed-width" rows="9">1</textarea><textarea name="command" class="setting-input   fixed-width" rows="5">2</textarea>

Test command 1:

# xmllint --xpath '//textarea[@name="command"]/text()' --html a.html

Test output 1:

 1
 2

Test command 2:

# xmllint --xpath0 '//textarea[@name="command"]/text()' --html a.html | xargs -0 -n1

Test output 2:

 1
 2

Tacmahack answered 30/7, 2018 at 14:41 Comment(5)

it would be great to have this feature merged – Fred 30/7, 2018 at 18:57

@AdamSiemion Not sure if they have rw access to their gnome git repo. If they host their source on github I'd be happy to send a pull request. Plus, need someone from their team to do some sanity check. – Tacmahack 30/7, 2018 at 19:13

@Tacmahack Your merge request is just languishing there, still open, but your --xpath fixes to add a newline were basically implemented for v2.9.9. So thanks! – Rabid 5/2, 2020 at 23:53

@TheDudeAbides Thanks for the reminder. That change hardcoded \n in strings, which makes separating with \0 almost impossible. So this patch can no longer be merged and I won't rebase. I'll just leave it there in case someone doesn't need the latest features but \0. – Tacmahack 6/2, 2020 at 5:13

@Tacmahack Bummer. Your efforts are appreciated, nonetheless. – Rabid 17/2, 2020 at 20:13

I did the following, ugly trick, please feel free to provide a better solution.

Changed the HTML code by replacing </textarea> with \n</textarea> using the following command:

sed 's/\<\/textarea/\'$'\n\<\\/textarea/g' f

Fred answered 30/8, 2013 at 13:5 Comment(2)

You can use other characters as separators for sed, e.g. %, so you don't need to escape the slash. – Stadiometer 26/2, 2016 at 14:23

If ugly, don't post it at all. 'sed' is not a XML parser – Batchelder 27/6, 2020 at 9:50

Below is a wrapper script intended exactly to the purpose of newlines delimited output (for old releases of xmllint).

Create a file xmllint2.sh with the contents. Then execute chmod u+x xmllint2.sh, finally run like:

./xmllint2.sh --xpath --html '//textarea[@name="command"]/text()' 2>/dev/null

(the last part of the command is to hide the output of warnings that occurs with html)

#!/bin/bash

# wrapper script to
# - have newline delimited output on Xpath querys
# - implements --xpath on very old releases

/usr/bin/xmllint --xpath &>/dev/null
implements_xpath=$?

newlines_delimited_xmllint_version=20909
current_version=$(xmllint --version |& awk 'NR==1{print $NF;exit}')

args=( "$@" )
if [[ $@ == *--xpath* ]]; then
    # iterate over positional parameters
    for ((i=0; i<${#args}; i++)); do
        if [[ ${args[i]} == --xpath ]]; then
            xpath="${args[i+1]}"
            unset args[i+1]
            unset args[i]
            break
        fi
    done
    if [[ ($implements_xpath==0 && $current_version>=20909) || $file == - || $file == /dev/stdin || $xpath == / || $xpath == string\(* ]]
    then
        exec /usr/bin/xmllint "$@"
    else
        exec /usr/bin/xmllint "${args[@]}" --shell <<< "cat $xpath" | sed '1d;$d;s/^ ------- *$//;/^$/d'
    fi
else
    exec /usr/bin/xmllint "$@"
fi

Check latest revision: https://github.com/sputnick-dev/xmllint

Debian Buster in June 29 2020 have version 2.9.4 which is 4 years old.
Debian testing/experimental have 2.9.10, which is the fixed version.

Another way to install 2.9.10 with Debian last stable: https://serverfault.com/a/1022826/120473 (without taking the risk of crashing the apt system)

Trona answered 28/6, 2020 at 23:13 Comment(2)

I tried 20910 and it doesn't appear to have the newline fix but I edited your script to remove the version check and was able to carry on with what I'm doing. – Protactinium 5/7, 2021 at 18:35

I may be missing something, @Gilles-Quenot, but I can't see how $file is assigned. Will those checks always be missed? – Charge 17/8, 2022 at 16:44

Newlines can legitimately appear in xml data. A more robust approach would be to delimit xpath results by a character that is guaranteed to not occur in XML data. The Null character, U+0000 in the Universal Coded Character Set, is one such character.

Note that the code point U+0000, assigned to the null control character, is the only character encoded in Unicode and ISO/IEC 10646 that is always invalid in any XML 1.0 and 1.1 document.
– https://en.wikipedia.org/wiki/Valid_characters_in_XML

@Cyker's merge request for xmllint included the addition of an -xpath0 option that would delimit xpath results by NUL. A new feature request for this functionality was opened as well.

Hopefully, xmllint will gain this feature soon.

xmlstarlet

In the mean time, another xpath command line tool, xmlstarlet, can be coaxed into achieving this goal now. xmlstarlet does not currently support output of NULs directly, but we can make it output U+FFFF, which, like NUL, is guaranteed to not occur in XML data (source). We then just need to translate U+FFFF to U+0000 and we'll have NUL delimited xpath results.

In the following examples, I'll use the following partial html file. It's the same example from the OP's question, except I added newlines for testing purposes.

cat >data.html <<'EOF'
<textarea name="command" class="setting-input fixed-width" rows="9">1 
 newline</textarea>
<textarea name="command" class="setting-input fixed-width" rows="5">2 
 newline</textarea>
EOF

Here is how to use xmlstarlet and sed to delimit the xpath results with NULs:

xmlstarlet fo -H -R data.html \
| xmlstarlet sel -t -m '//textarea[@name="command"]' -v '.' -o $'\uffff' \
| sed s/$'\uFFFF'/\\x00/g

^{perl could be used instead of sed, if you prefer: perl -CS -0xFFFF -l0 -pe ''}

Note: I ran the HTML through xmlstarlet fo -H -R as shown in @TheDudeAbides answer.

Now that the xpath results are delimited by NULs, we can process the results with the help of xargs -0. Example:

xmlstarlet fo -H -R data.html \
| xmlstarlet sel -t -m '//textarea[@name="command"]' -v '.' -o $'\uffff' \
| sed s/$'\uFFFF'/\\x00/g \
| xargs -0 -n 1 printf '%q\n'

Result:

'1 '$'\n'' newline'
'2 '$'\n'' newline'

or load it into a bash array:

mapfile -t -d '' a < <(
 xmlstarlet fo -H -R data.html \
 | xmlstarlet sel -t -m '//textarea[@name="command"]' -v '.' -o $'\uffff' \
 | sed s/$'\uFFFF'/\\x00/g
)

declare -p a

Result:

declare -a a=([0]=$'1 \n newline' [1]=$'2 \n newline')

saxon

Same technique using saxon instead of xmlstarlet:

xmllint --html data.html --dropdtd --xmlout \
| java -cp "$CP" net.sf.saxon.Query -s:- -qs:'//textarea[@name="command"]' !method=text !item-separator=$'\uFFFF' \
| sed s/$'\uFFFF'/\\x00/g \
| xargs -0 -n 1 printf '%q\n'

Docile answered 9/7, 2021 at 1:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

xmlstarlet

saxon

Recommended topics

Hot tags