Newlines can legitimately appear in xml data. A more robust approach would be to delimit xpath results by a character that is guaranteed to not occur in XML data. The Null character, U+0000 in the Universal Coded Character Set, is one such character.
Note that the code point U+0000, assigned to the null control
character, is the only character encoded in Unicode and ISO/IEC 10646
that is always invalid in any XML 1.0 and 1.1 document.
– https://en.wikipedia.org/wiki/Valid_characters_in_XML
@Cyker's merge request for xmllint
included the addition of an -xpath0
option that would delimit xpath results by NUL. A new feature request for this functionality was opened as well.
Hopefully, xmllint
will gain this feature soon.
xmlstarlet
In the mean time, another xpath command line tool, xmlstarlet
, can be coaxed into achieving this goal now. xmlstarlet
does not currently support output of NULs directly, but we can make it output U+FFFF
, which, like NUL, is guaranteed to not occur in XML data (source). We then just need to translate U+FFFF
to U+0000
and we'll have NUL delimited xpath results.
In the following examples, I'll use the following partial html file. It's the same example from the OP's question, except I added newlines for testing purposes.
cat >data.html <<'EOF'
<textarea name="command" class="setting-input fixed-width" rows="9">1
newline</textarea>
<textarea name="command" class="setting-input fixed-width" rows="5">2
newline</textarea>
EOF
Here is how to use xmlstarlet
and sed
to delimit the xpath results with NULs:
xmlstarlet fo -H -R data.html \
| xmlstarlet sel -t -m '//textarea[@name="command"]' -v '.' -o $'\uffff' \
| sed s/$'\uFFFF'/\\x00/g
perl
could be used instead of sed
, if you prefer: perl -CS -0xFFFF -l0 -pe ''
Note: I ran the HTML through xmlstarlet fo -H -R
as shown in @TheDudeAbides answer.
Now that the xpath results are delimited by NULs, we can process the results with the help of xargs -0
. Example:
xmlstarlet fo -H -R data.html \
| xmlstarlet sel -t -m '//textarea[@name="command"]' -v '.' -o $'\uffff' \
| sed s/$'\uFFFF'/\\x00/g \
| xargs -0 -n 1 printf '%q\n'
Result:
'1 '$'\n'' newline'
'2 '$'\n'' newline'
or load it into a bash array:
mapfile -t -d '' a < <(
xmlstarlet fo -H -R data.html \
| xmlstarlet sel -t -m '//textarea[@name="command"]' -v '.' -o $'\uffff' \
| sed s/$'\uFFFF'/\\x00/g
)
declare -p a
Result:
declare -a a=([0]=$'1 \n newline' [1]=$'2 \n newline')
saxon
Same technique using saxon instead of xmlstarlet:
xmllint --html data.html --dropdtd --xmlout \
| java -cp "$CP" net.sf.saxon.Query -s:- -qs:'//textarea[@name="command"]' !method=text !item-separator=$'\uFFFF' \
| sed s/$'\uFFFF'/\\x00/g \
| xargs -0 -n 1 printf '%q\n'
xmllint --xpath '//textarea[@name="command"]/text()' --html f
– Fredxmllint
. – Fred--xpath
now (finally) does the thing you expect when dumping XPath nodes. If you are stuck with an old libxml2, see my answer below for an alternative solution using XMLStarlet. – Rabid