xmllint: how to convert UTF-8 numeric references into characters
Asked Answered
T

4

6

I'd like to convert UTF-8 numeric references into characters in the output from xmllint.

To reproduce:

$ wget http://il.srgssr.ch/integrationlayer/1.0/ue/rts/video/play/4727630.xml
$ xmllint --xpath "/Video/AssetMetadatas/AssetMetadata/title/text()" 4727630.xml && echo
Le jardin apprivoisé - Entre pierre et bois

I'd like the output to be:

Le jardin apprivoisé - Entre pierre et bois

I've read the man page and tried different options, but nothing worked.

If possible I'd like to achieve this using options from xmllint, or if this is not possible with another command line tool which is commonly found in Linux distributions.

Thanks!

Tartarous answered 4/2, 2015 at 18:0 Comment(0)
A
7

I understand that the question is a little bit outdated by I came here from Google and want to share possible answer for future visitors. It is necessary to slightly change xpath expression and use string() function instead of text():

$ wget http://il.srgssr.ch/integrationlayer/1.0/ue/rts/video/play/4727630.xml
$ xmllint --xpath "string(/Video/AssetMetadatas/AssetMetadata/title)" 4727630.xml
Le jardin apprivoisé - Entre pierre et bois
Albarran answered 9/9, 2016 at 10:20 Comment(2)
Thank you. It was very useful!Worrisome
Awesome, that solve my problem, too, which already caused me some headache! Thanks!Mucus
P
1

I have found another way which I think can completely solves this problem. The trick is using the recode library provided by GNU to change output encoding from html to utf8.

$ wget http://il.srgssr.ch/integrationlayer/1.0/ue/rts/video/play/4727630.xml
$ xmllint --xpath "/Video/AssetMetadatas/AssetMetadata/title/text()" 4727630.xml | recode html..utf8
Le jardin apprivoisé - Entre pierre et bois

recode can be installed using apt-get install recode.

Pringle answered 9/11, 2020 at 15:10 Comment(0)
E
0

I'm using xmllint against non-valid HTML5 chunks, where I don't have charset attribute declared in head. So, I'm using cat to add on-the-fly only necessary line to make xmllint happy with UTF8 input and output it as UTF8 correctly:

echo '<meta charset="utf8">' | cat - fileWriteInUTF8.chunk | \
     xmllint --html --xpath 'string(//video/source/@src)' 2>/dev/null -

HTML5 content in fileWriteInUTF8.chunk:

<video>
 <source src="/path/to/content_with_accent-éàü.mp4">
</video>

Output after cat:

<meta charset="utf8">
<video>
 <source src="/path/to/content-with_accent-éàü.mp4">
</video>

I'm using 2>/dev/null to drop HTML invalid warning message, use it with care!

I know it's an uggly solution, but I don't find better actually.

Eustatius answered 13/3, 2023 at 22:27 Comment(0)
S
-2

How about good old sed and echo?

$ wget http://il.srgssr.ch/integrationlayer/1.0/ue/rts/video/play/4727630.xml
$ echo -e $(xmllint --xpath "/Video/AssetMetadatas/AssetMetadata/title/text()" 4727630.xml | sed -e 's/&#x/\\u/g' -e 's/;//g')
Le jardin apprivoisé - Entre pierre et bois
Suicidal answered 7/9, 2021 at 4:5 Comment(2)
You are providing an imperfect solution to a question asked 6.4 years ago which had a decent answer provided 5 years ago. It is unclear how your solution improves on the previous solution. Augmenting your answer with an explanation for why it is better than the previous answer would be useful.Sanjuana
If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context.Venator

© 2022 - 2024 — McMap. All rights reserved.