How to do command line XPath queries in huge XML files?
Asked Answered
I

2

14

I have a collection of XML files, and some of them are pretty big (up to ~50 million element nodes). I am using xmllint for validating those files, which works pretty nicely even for the huge ones thanks to the streaming API.

xmllint --loaddtd --stream --valid /path/to/huge.xml

I recently learned that xmllint is also capable of doing command line XPath queries, which is very handy.

xmllint --loaddtd --xpath '/root/a/b/c/text()' /path/to/small.xml

However, these XPath queries do not work for the huge XML files. I just receive a "Killed" message after some time. I tried to enable the streaming API, but this just leads to no output at all.

xmllint --loaddtd --stream --xpath '/root/a/b/c/text()' /path/to/huge.xml

Is there a way to enable streaming mode when doing XPath queries using xmllint? Are there other/better ways to do command line XPath queries for huge XML files?

Intend answered 18/5, 2015 at 14:21 Comment(7)
try --shell option for interactive (with just the xml file path)Hendricks
I tried opening the interactive shell for a huge file, but it will crash ("Killed", just as in the case of not using --stream) before I can enter any command.Intend
superuser.com/questions/543881/…Phosphorylase
attaching a sample XML file would help – I, for one, have no idea what large might mean in your case.Vining
Think of something like the dblp XML dump (dblp.dagstuhl.de/xml). I receive the "Killed" error when parsing that file in a non-streaming context. But my question is aimed at essentially any file that is big enough such that you would be ill advised to build a DOM in main memory and should rather use a streaming approach instead.Intend
What about using XSLT 3.0 streaming functions for that? It could be more predictable and safer.Floccus
Internally, libxml2 has some support for streaming XPath expressions, but xmllint (the command-line interface to libxml2) doesn't support the combination of --xpath and --stream.Moulin
C
5

If your XPath expressions are very simple, try xmlcutty.

From the homepage:

xmlcutty is a simple tool for carving out elements from large XML files, fast. Since it works in a streaming fashion, it uses almost no memory and can process around 1G of XML per minute.

Cheesy answered 28/10, 2016 at 8:48 Comment(1)
A command like xmllint --loaddtd --xpath '/root/a/b/c/text()' /path/to/small.xml would be translated into xmlcutty -path '/root/a/b/c' -rename '\n' /path/to/small.xml - where the rename is meant to rename the last enclosing element - and thus simulating a text() - the syntax is bit arcane.Lanark
L
-1

change ulimits might work. Try this:

$ ulimit -Sv 500000
$ xmllint (...your command)
Lopes answered 19/2, 2018 at 12:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.