Is there any XPath processor for SAX model?
Asked Answered
L

13

61

I'm looking for an XPath evaluator that doesn't rebuild the whole DOM document to look for the nodes of a document: actually the object is to manage a large amount of XML data (ideally over 2Gb) with SAX model, which is very good for memory management, and give the possibility to search for nodes.

Thank you all for the support!

For all those who say it's not possible: I recently, after asked the question, found a project named "saxpath" (http://www.saxpath.org/), but I can't find any implementing project.

Lovemaking answered 7/12, 2009 at 22:5 Comment(3)
What do you mean by 'not rebuild'? Do you want to stream data and have xpaths fire when they match?Lashawnda
You may want to investigate extended VTD-XML, which seems to be well suited for the situation you described: big xml + xpathSpleenful
code.google.com/p/xpath4saxOtis
A
18

My current list (compiled from web search results and the other answers) is:

The next step is to use the examples of XMLDog and compare the performance of all these approaches. Then, the test cases should be extended to the supported XPath expressions.

Arvonio answered 7/12, 2009 at 22:5 Comment(2)
Great list, saved me lots of time. Some remarks about them regarding their activity and ease of inclusion in a project: 1. xpath4sax: Not in maven. Last commit at 7 Feb 2013 google code repo, github repo 2. spex: Not in maven. Last commit at 9 jan 2006 cvs repo I'm also going for XMLDog which is both more active and available at maven repo.Kitsch
@KonstantinosFilios my answer is a community wiki - feel free to add the information directly in the answer.Arvonio
H
12

We regularly parse 1GB+ complex XML files by using a SAX parser which extracts partial DOM trees that can be conveniently queried using XPath. I blogged about it here: http://softwareengineeringcorner.blogspot.com/2012/01/conveniently-processing-large-xml-files.html - Sources are available on github - MIT License.

Hydrography answered 7/1, 2012 at 15:46 Comment(1)
The sources for the blogpost appear to be no longer in Github.Apophasis
A
9

XPath DOES work with SAX, and most XSLT processors (especially Saxon and Apache Xalan) do support executing XPath expressions inside XSLTs on a SAX stream without building the entire dom.

They manage to do this, very roughly, as follows :

  1. Examining the XPath expressions they need to match
  2. Receiving SAX events and testing if that node is needed or will be needed by one of the XPath expressions.
  3. Ignoring the SAX event if it is of no use for the XPath expressions.
  4. Buffering it if it's needed

How they buffer it is also very interesting, cause while some simply create DOM fragments here and there, others use very optimized tables for quick lookup and reduced memory consumption.

How much they manage to optimize largely depends on the kind of XPath queries they find. As the already posted Saxon documentation clearly explain, queries that move "up" and then traverse "horizontally" (sibling by sibling) the document obviously requires the entire document to be there, but most of them require just a few nodes to be kept into RAM at any moment.

I'm pretty sure of this because when I was still making every day webapp using Cocoon, we had the XSLT memory footprint problem each time we used a "//something" expression inside an XSLT, and quite often we had to rework XPath expressions to allow a better SAX optimization.

Ably answered 31/8, 2011 at 15:2 Comment(0)
C
6

SAX is forward-only, while XPath queries can navigate the document in any direction (consider parent::, ancestor::, preceding:: and preceding-sibling:: axis). I don't see how this would be possible in general. The best approximation would be some sort of lazy-loading DOM, but depending on your queries this may or may not give you any benefit - there is always a worst-case query such as //*[. != preceding::*].

Cradlesong answered 7/12, 2009 at 22:25 Comment(4)
I understand the point, I can also agree... but what if I want to avoid going back? I designed a robust RDBMS and I'm exploring the possibility to support XML types: it could be possible to set functions for evaluating forward-only selects. I found the project SAXPath, which is just an API: I'm trying to find an implementation to this.Lovemaking
In theory, you can have a subset of XPath that's forward-only, and make it run on top of SAX. Supposedly "streaming transforms" are part of the upcoming XSLT 2.1, and if I understand them correctly, they include such a subset of XPath. I know that Saxon already has "streaming" XSLT mode, and it also exposes the XPath API separately, so perhaps you should check to see if the latter also supports streaming. I've had a quick glance at Saxon docs, and it's not clear whether XPath API specifically has it; you might have to ask the author, Michael Kay, regarding that.Cradlesong
That doesn't seem impossible, as a simple solution would be to re-stream the file for every XPath expression's evaluation, for instance., and to implement xpath functions that manipulate hierarchy to use a similar approach. However I don't know of a lib doing that (I'm interested as well, obviously, that's why I'm here...). And that'd be dog-slow and awful, but wouldn't die on large files...Purebred
It would have to restream it for every _sub_expression that needs to backtrack... you could easily get O(n^2) here, and possibly worse. Note also that SAX processors generally feed you the stream elements as they parse them, so restreaming would also mean reparsing all that XML every time.Cradlesong
D
4

Sorry, a slightly late answer here - it seems that this is possible for a subset of XPath - in general it's very difficult due to the fact that XPath can match both forwards and backwards from the "current" point. I'm aware of two projects that solve it to some degree using state machines: http://spex.sourceforge.net & http://www.cs.umd.edu/projects/xsq. I haven't looked at them in detail but they seem to use a similar approach.

Ducharme answered 22/1, 2010 at 17:55 Comment(1)
Query rewriting to eliminate back references from XPath was an active research area some time ago (with SPEX and related work)Middlesworth
M
3

I'll toss in a plug for a new project of mine, called AXS. It's at https://code.google.com/p/annotation-xpath-sax/ and the idea is that you annotate methods with (forward-axis-only) XPath statements and they get called when the SAX parser is at a node that matches it. So with a document

<doc>
<nodes>
  <node name="a">text of node 1</node>
  <node name="b">text of node 2</node>
  <node otherattr="I have attributes!">text of node 3</node>
</nodes>
</doc>

you can do things like

@XPath("/nodes/node")
void onNode(String nodeText)
{
  // will be called with "text of node [123]"
}

or

@XPathStart("//node[@name='']")
void onNode3(Attrs node3Attrs) { ... }

or

@XPathEnd("/nodes/node[2]")
void iDontCareAboutNode3() throws SAXExpression
{
  throw new StopParsingExpression();
}

Of course, the library is so new that I haven't even made a release of it yet, but it's MIT licensed, so feel free to give it a try and see if it matches your need. (I wrote it to do HTML screen-scraping with low enough memory requirements that I can run it on old Android devices...) If you find bugs, please let me know by filing them on the googlecode site!

Moorings answered 11/7, 2013 at 22:55 Comment(0)
S
2

There are SAX/StAX based XPath implementations, but they only support a small subset of XPath expressions/axis largely due to SAX/StAX's forward only nature.. the best alternative I am aware of is extended VTD-XML, it supports full xpath, partial document loading via mem-map.. and a max document size of 256GB, but you will need 64-bit JVM to use it to its full potential

Spleenful answered 7/12, 2009 at 22:39 Comment(0)
V
1

What you could do is hook an XSL transformer to a SAX input source. Your processing will be sequential and the XSL preprocessor will make an attempt to catch the input as it comes to fiddle it into whatever result you specified. You can use this to pull a path's value out of the stream. This would come in especially handy if you wanted to produce a bunch of different XPATH results in one pass.

You'll get (typically) an XML document as a result, but you could pull your expected output out of, say, a StreamResult with not too much hassle.

Viccora answered 7/12, 2009 at 22:15 Comment(0)
F
1

Have a look at the streaming mode of the Saxon-SA XSLT-processor.

http://www.saxonica.com/documentation/sourcedocs/serial.html

"The rules that determine whether a path expression can be streamed are:

  • The expression to be streamed starts with a call on the document() or doc() function.
  • The path expression introduced by the call on doc() or document must conform to a subset of XPath defined as follows:

  • any XPath expression is acceptable if it conforms to the rules for path expressions appearing in identity constraints in XML Schema. These rules allow no predicates; the first step (but only the first) can be introduced with "//"; the last step can optionally use the attribute axis; all other steps must be simple Axis Steps using the child axis.

  • In addition, Saxon allows the expression to contain a union, for example doc()/(*/ABC | /XYZ). Unions can also be expressed in abbreviated form, for example the above can be written as doc()//(ABC|XYZ).
  • The expression must either select elements only, or attributes only, or a mixture of elements and attributes.

  • Simple filters (one or more) are also supported. Each filter may apply to the last step or to the expression as a whole, and it must only use downward selection from the context node (the self, child, attribute, descendant, descendant-or-self, or namespace axes). It must not be positional (that is, it must not reference position() or last(), and must not be numeric: in fact, it must be such that Saxon can determine at compile time that it will not be numeric). Filters cannot be applied to unions or to branches of unions. Any violation of these conditions causes the expression to be evaluated without the streaming optimization.

  • These rules apply after other optimization rewrites have been applied to the expression. For example, some FLWOR expressions may be rewritten to a path expression that satisfies these rules.

  • The optimization is enabled only if explicitly requested, either by using the saxon:stream() extension function, or the saxon:read-once attribute on anXSLT xsl:copy-of instruction, or the XQuery pragma saxon:stream. It is available only if the stylesheet or query is processed using Saxon-SA."

Note: It is most likely in the commercial version this facility is available. I've used Saxon extensively earlier, and it is a nice piece of work.

Fossick answered 7/12, 2009 at 23:16 Comment(3)
I don't see Saxonica meeting the requirement of processing docs > 2GB in sizeSpleenful
@Thorbjorn: this sounds promising but I also can't find what the question asker is asking for. Could you be more specific in your answer? Thanks.Nodical
The link is 404, I think, the current (not quite equal version) is available at: saxonica.com/documentation/sourcedocs/streaming/…Arvonio
B
0

Mmh I don't know if I really understand you. As far as I know, the SAX model is event oriented. That means, you do something if a certain node is encountered during the parsing. Yeah, it is better for memory but I don't see how you would like to get XPath into it. As SAX does not build a model, I don't think that this is possible.

Black answered 7/12, 2009 at 22:12 Comment(0)
H
-1

I don't think xpath works with SAX, but you might take a look at StAX which is an extended streaming XML API for Java.

http://en.wikipedia.org/wiki/StAX

Haukom answered 7/12, 2009 at 22:13 Comment(0)
C
-1

The standard javax xpath API technically already works with streams; javax.xml.xpath.XPathExpression can be evaluated against an InputSource, which in turn can be constructed with a Reader. I don't think it constructs a DOM under the covers.

Caryopsis answered 7/12, 2009 at 22:18 Comment(3)
InputSource is just a helper class to aggregate byte stream + encoding + DOCTYPE. When you feed it to an XPath processor, it will still construct a DOM internally.Cradlesong
Hmm, I was afraid of that. I suspect that stream-based XPath processing is simply too difficult a problem to solve.Caryopsis
Paver is right: the InputSource is just a source for retrieving the data, but then the document is built and this takes long long time and a lot of memory usage, not optimal for what I need to do.Lovemaking
C
-1

Did you have tried also QuiXPath https://code.google.com/p/quixpath/ ?

Cosgrave answered 9/2, 2015 at 13:26 Comment(1)
none of the QUIX* projects are really in any buildable form, the dependencies are hard to find, no build files, Why are these recommended if you never put them in a useable form?Tympanites

© 2022 - 2024 — McMap. All rights reserved.