XPath.evaluate performance slows down (absurdly) over multiple calls
Asked Answered
K

6

35

I am trying to use the javax.xml.xpath package to run XPath expressions on a document with multiple namespaces, and I'm having goofy performance problems.

My test document is pulled from a real, production example. It is about 600k of xml. The document is a fairly complex Atom feed.

I realize that what I'm doing with XPath could be done without. However, the same implementation on other, vastly inferior platforms performs absurdly better. Right now, rebuilding my system to not use XPath is beyond the scope of what I can do in the time that I have.

My test code is something like this:



void testXPathPerformance()
{
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    factory.setNamespaceAware(true);
    DocumentBuilder builder = factory.newDocumentBuilder();

    Document doc = builder.parse(loadTestDocument());

    XPathFactory xpf = XPathFactory.newInstance();
    XPath xp = xpf.newXPath();

    NamespaceContext names = loadTestNamespaces();
    //there are 12 namespaces in names.  In this example code, I'm using
    //'samplens' instead of the actual namespaces that my application uses
    //for simplicity.  In my real code, the queries are different text, but
    //precisely the same complexity.

    xp.setNamespaceContext(names);

    NodeList nodes = (NodeList) xp.evaluate("/atom:feed/atom:entry",
                     doc.getDocumentElement(), XPathConstants.NODESET);


    for(int i=0;i<nodes.getLength();i++)
    {
        printTimestamp(1);
        xp.evaluate("atom:id/text()", nodes.item(i));
        printTimestamp(2);
        xp.evaluate("samplens:fieldA/text()", nodes.item(i));
        printTimestamp(3);
        xp.evaluate("atom:author/atom:uri/text()", nodes.item(i));
        printTimestamp(4);
        xp.evaluate("samplens:fieldA/samplens:fieldB/&at;attrC", nodes.item(i));
        printTimestamp(5);

        //etc.  My real example has 10 of these xp.evaluate lines

     }
}

When I run on a Nexus One, (not in the debugger, but with USB connected), the first time through the loop, each xp.evaluate takes somewhere from 10ms to 20ms. By the 15th time through the loop, each xp.evaluate takes somewhere from 200ms to 300ms. By the end of the loop (there are 150 items in nodes), it takes about 500ms-600ms for each xp.evaluate.

I've tried using xp.compile(). The compiles all take <5ms. I've done xp.reset() (makes no difference). I've done a new XPath object for each evaluate (adds about 4ms).

Memory usage does not appear to spiral out of control during execution.

I'm running this on a single thread in a JUnit test case that doesn't create an activity or anything.

I'm really puzzled.

Does anybody have any idea what else to try?

Thanks!

update

If I run the for loop backwards (for(int i=nodes.getLength()-1;i>=0;i--)), then the first few nodes take the 500ms-600ms, and the last ones go fast 10ms-20ms. So, this seems like it has nothing to do with the number of calls, but instead that expressions whose context is near the end of the document take longer than expressions whose context is near the beginning of the document.

Does anybody have any thoughts on what I can do about this?

Keratogenous answered 23/9, 2010 at 21:19 Comment(6)
@Andrew Shelansky: Did you try to run just one query ussing | union node set oparator? Result node set would be in document order.Antoine
@Andrew Shelansky: My guess would be that the NodeList being return by the XPath expression is evaluated lazily. So each time you do nodes.item(i) it is having to count through i items to find the node. Try storing the node in the variable at the start of the loop and see if that helps.Bookkeeper
@Nick Jones. In my test code, I'm doing lazy eval for nodes.item(i). In my production code, I'm actually iterating through nodes immediately after calling the first xp.evaluate. The resulting nodes get stored in a hashmap from UUID to Node, and evaluated that way. The production code exhibits the same problem. Good thought, though.Keratogenous
I cannot help, but wanted to commiserate that 'absurd' also describes my experience with trying to use the reference javax.xml.xpath in production. The only real solution for us was switching everything to Jaxen. Don't know if that's even possible on android :(January
@Alejandro: no, I didn't do that. I don't actually know ahead of time what the document order is going to be for the items that I want. As far as I can tell, though, the only important consideration for how long it will take to execute is how far down in the document the context node is.Keratogenous
I still haven't figured out why this is going on, other than to be certain that it is purely about how far from the top of the document the context node is. For my purposes, since I'm always working with fairly large documents, and never using XPaths that care about the parents/ancestors, I'm just calling cloneNode() before calling xp.evaluate. It runs about 800% faster. This is a terrible "solution" because I know that some day I'll have an expression that cares about the parent, but for now...Keratogenous
T
65

Try adding this code inside the loop at the top;

Node singleNode = nodes.item(i);
singleNode.getParentNode().removeChild(singleNode);

then run each evaluation using the singleNode variable instead of nodes.item(i); (of course you change the name)

Doing this detaches the node you are working with from the large main document. This will speed up the evaluate methods processing time by a huge amount.

EX:

for(int i=0;i<nodes.getLength();i++)
{
    Node singleNode = nodes.item(i);
    singleNode.getParentNode().removeChild(singleNode);

    printTimestamp(1);
    xp.evaluate("atom:id/text()", singleNode );
    printTimestamp(2);
    xp.evaluate("samplens:fieldA/text()", singleNode );
    printTimestamp(3);
    xp.evaluate("atom:author/atom:uri/text()", singleNode );
    printTimestamp(4);
    xp.evaluate("samplens:fieldA/samplens:fieldB/&at;attrC", singleNode );
    printTimestamp(5);

    //etc.  My real example has 10 of these xp.evaluate lines

 }
Talkie answered 27/1, 2012 at 0:21 Comment(4)
+1 for detachment tip. Improved my code from several minutes to less than 10 secs !Winser
Yes that makes a huge difference.Merat
I can't believe it works, but it does. In my case, rather than removing the node I cloned it and still saw a twenty-fold performance improvement.Inextensible
You sort of got me onto the right track. I did something similar to removing the node, I cloned it. It reduced my processing time from 12 minutes to 10 seconds. I'm not joking.Royall
F
22

Try cloning the node (so you won't have unnecessary references from its ancestors)

Node singleNode = nodes.item(i).cloneNode(true);

If you remove children, you will lose references and only get half of the nodes you want to process.

Frequentative answered 1/2, 2016 at 20:6 Comment(4)
I used this for parsing incoming messages where the obvious way was hopelessly inadequate. The speed increase is as ridiculous as it is unexpected.Cabinetwork
This is the absolute fix!!! I ran a xpath on 45K child nodes and it took 50 minutes! after using clone it only took 5 seconds. This is an absolute bizarre from java implementation.Strath
No changes observed in my case.Selfimprovement
NodeList result = (NodeList) expr.evaluate(inputStr, XPathConstants.NODESET); for (int index = 0; index < result.getLength(); index++) { Node node = result.item(index).cloneNode(true); String value = node.getTextContent(); xpathOut.add(value); } Any thing wrong here?Selfimprovement
J
17

This seems to be another case where using XPath appears to be slow but instead of XPath, the reason is probably caused by DOM method nodelist.item(i)

The default implementation of NodeList in Java has certain features:

  1. It is evaluated lazily
  2. The DOM list is live
  3. It is implemented as a linked list
  4. The list has some caching

When you look at those features separately, you could wonder why should the result object of an XPath expression have a feature like that, but they make more sense when you put them together.

1) Lazy evaluation might blur the the location of a performance bottleneck. Because of it, returning the NodeList appears to be fast but if the task is to always iterate through the list, it more or less just defers the performance cost. Lazy evaluation becomes costly, if the evaluation of the whole list must be processed again every time when the next item in the list is read.

2) NodeList being a "live" list means that it is updated and refers to nodes that are currently in the document tree, not to nodes that were in the tree when the list was initially constructed or to clones of those nodes. This is an important feature to grasp for DOM beginners. For example, if you select a NodeList of sibling elements and try to add one new sibling element to each node, taking a step to item(i+1) will always reach the latest added node and the loop will never finish.

3) The list being live also gives some explanation why it is implemented as a linked list (or AFAIK the actual implementation is a doubly linked list). The effect of this can clearly be seen on your test where accessing the last elements is always the slowest, whether you iterate it through backwards or forwards.

4) Because of the caching, looping over a single list while not causing any changes to the tree should be fairly efficient, if the cache stays clean. In some versions of Java there has been problems with this caching. I have not investigated what all procedures invalidate caching but probably the safest bets would be to advice to keep the evaluated expression the same, make no changes to tree, loop over one list at a time, and always step to next or previous list item.

Real performance wins depend on the use case, of course. Instead of just tweaking the list looping, you should try getting rid of looping a live list altogether - at least for reference. Cloning makes the list not live. Direct access to nodes can be achieved by copying the nodes to an array. If the structure is suitable, you can also use other DOM methods like getNextSibling() which said to give more effective results than looping over a NodeList.

Jolson answered 21/12, 2011 at 20:35 Comment(1)
Great answer. I'd love to see some code examples - how do you clone a node list, what's the quickest way to turn it into an array of nodes, etc?Fatma
T
3

It because nodes.getLength() take long time, just make it out of for loop

int nodes_len=nodes.getLength();
for(int i=0;i<nodes_len;i++)
{
//your code here
}
Talca answered 2/11, 2022 at 7:41 Comment(0)
K
0

This is a little bit late, but I ran into the same situation, but it seemed like my document was so big that none of the other answers really resolved the problem.

Eventually, I found jaxen. Once I used it, the document which previously took 15 seconds to parse took mere milliseconds.

Jaxen is unfortunately rather badly documented, but worked quite well:

DOMXPath myXPath = new DOMXPath("atom:id/text()");
String myContent = myXPath.stringValueOf(myDocument);

The Java Doc can be found here http://jaxen.codehaus.org/apidocs/org/jaxen/dom/DOMXPath.html

Kaunas answered 24/3, 2015 at 0:59 Comment(1)
As of this writing, the links are dead.Kurtzman
G
0

Every time you take a Node from a Nodelist, seems that it keep references to whole structure of xml; for this reason when you navigate the node, the xpath process starts every time from root of xml, and for this reason, when you go down in the trhee it takes more time.

For this reason when you take a node, before navigate it, you have to cast in string by this method:

private String nodeToString(Node node) {
          StringWriter sw = new StringWriter();
          try {
            Transformer t = TransformerFactory.newInstance().newTransformer();
            t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
            t.transform(new DOMSource(node), new StreamResult(sw));
          } catch (TransformerException te) {
            System.out.println("nodeToString Transformer Exception");
          }
          return sw.toString();
        }

and then retransform it in a Element / Node:

String xml = nodeToString(node);

Element nodeNew =  DocumentBuilderFactory
        .newInstance()
        .newDocumentBuilder()
        .parse(new ByteArrayInputStream(xml.getBytes()))
        .getDocumentElement();

node = nodeNew;

In this way the new Element, lost all references to his ancestors, and will be use as a simple Node and not as a nested Node. Obviously this method is good only if you have to navigate deep into a node.

Goodfellowship answered 12/5, 2016 at 15:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.