Speeding up xpath

Asked 2/3, 2010 at 17:24 Answered 27/2, 2013 at 0:55

I have a 1000 entry document whose format is something like:

<Example>
     <Entry>
          <n1></n1>
          <n2></n2>
      </Entry>
      <Entry>
          <n1></n1>
          <n2></n2>
      </Entry>
      <!--and so on-->

There are more than 1000 Entry nodes here. I am writing a Java program which basically gets all the node one by one and do some analyzing on each node. But the problem is that the retrieval time of the nodes increases with its no. For example it takes 78 millisecond to retrieve the first node 100 ms to retrieve the second and it keeps on increasing. And to retrieve the 999 node it takes more than 5 second. This is extremely slow. We would be plugging this code to XML files which have even more than 1000 entries. Some like millions. The total time to parse the whole document is more than 5 minutes.

I am using this simple code to traverse it. Here nxp is my own class which has all the methods to get nodes from xpath.

nxp.fromXpathToNode("/Example/Entry" + "[" + i  + "]", doc);

and doc is the document for the file. i is the no of node to retrieve.

Also when i try something like this

List<Node> nl = nxp.fromXpathToNodes("/Example/Entry",doc);  
      content = nl.get(i);

I face the same problem.

Anyone has any solution on how to speed up the tretirival of the nodes, so it takes the same amount of time to get the 1st node as well as the 1000 node from the XML file.

Here is the code for xpathtonode.

public Node fromXpathToNode(String expression, Node context)  
{  
    try  
    {  
        return (Node)this.getCachedExpression(expression).evaluate(context, XPathConstants.NODE);  
    }  
    catch (Exception cause)  
    {  
        throw new RuntimeException(cause);  
    }  
}

and here is the code for fromxpathtonodes.

public List<Node> fromXpathToNodes(String expression, Node context)  
{  
    List<Node> nodes = new ArrayList<Node>();  
    NodeList results = null;  
    
    try  
    {  
        results = (NodeList)this.getCachedExpression(expression).evaluate(context, XPathConstants.NODESET);  
          
        for (int index = 0; index < results.getLength(); index++)  
        {  
            nodes.add(results.item(index));  
        }  
    }  
    catch (Exception cause)  
    {  
        throw new RuntimeException(cause);  
    }  
    
    return nodes;  
}

and here is the starting

public class NativeXpathEngine implements XpathEngine  
{      
private final XPathFactory factory;  
  
private final XPath engine;  

/**
 * Cache for previously compiled XPath expressions. {@link XPathExpression#hashCode()}
 * is not reliable or consistent so use the textual representation instead.
 */  
private final Map<String, XPathExpression> cachedExpressions;  
  
public NativeXpathEngine()  
{
    super();  
    
    this.factory = XPathFactory.newInstance();  
    this.engine = factory.newXPath();  
    this.cachedExpressions = new HashMap<String, XPathExpression>();  
}

Cobaltic answered 2/3, 2010 at 17:24 Comment(5)

The code in fromXpathToNode and fromXpathToNodes would seem to be quite relevant here. Can you provide that code? – Bick 2/3, 2010 at 17:32

need to see your code that loads the doc. – Starryeyed 2/3, 2010 at 17:33

If you're going to hit every entry, why use XPath? – Nicaragua 2/3, 2010 at 18:8

what do you mean? What to use theN? – Cobaltic 2/3, 2010 at 18:18

I think he means that in this case, why not just iterate directly over the DOM node child lists. – Kapellmeister 2/3, 2010 at 18:25

Try VTD-XML. It uses less memory than DOM. It is easier to use than SAX and supports XPath. Here is some sample code to help you get started. It applies an XPath to get the Entry elements and then prints out the n1 and n2 child elements.

final VTDGen vg = new VTDGen();
vg.parseFile("/path/to/file.xml", false);

final VTDNav vn = vg.getNav();
final AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/Example/Entry");
int count = 1;
while (ap.evalXPath() != -1) {
    System.out.println("Inside Entry: " + count);

    //move to n1 child
    vn.toElement(VTDNav.FIRST_CHILD, "n1");
    System.out.println("\tn1: " + vn.toNormalizedString(vn.getText()));

    //move to n2 child
    vn.toElement(VTDNav.NEXT_SIBLING, "n2");
    System.out.println("\tn2: " + vn.toNormalizedString(vn.getText()));

    //move back to parent
    vn.toElement(VTDNav.PARENT);
    count++;
}

Strike answered 10/8, 2010 at 12:18 Comment(2)

+1 for mentioning this awesome lib. I faced a similar problem by parsing some xPathExpressions which took near by 1min to complete. VTD-XML does the same job in 2seks. – Excommunication 11/11, 2010 at 8:52

The library mentioned is truly amazing. Depending on the use case, one has to check if shipping a code using this GPL'd library carries legal dependencies, since it builds on US patents 7133857, 7260652, and 7761459. – Dextran 10/2, 2017 at 15:4

The correct solution is to detach the node right after you call item(i), like so:

Node node = results.item(index)
node.getParentNode().removeChild(node)
nodes.add(node)

See XPath.evaluate performance slows down (absurdly) over multiple calls

Pressor answered 27/2, 2013 at 0:55 Comment(0)

I had similar issue with the Xpath Evaluation , I tried using CachedXPathAPI’s which is faster by 100X than the XPathApi’s which was used earlier. more information about this Api is provided here : http://xml.apache.org/xalan-j/apidocs/org/apache/xpath/CachedXPathAPI.html

Hope it helps. Cheers, Madhusudhan

Somato answered 12/3, 2012 at 12:5 Comment(0)

If you need to parse huge but flat documents, SAX is a good alternative. It allows you to handle the XML as a stream instead of building a huge DOM. Your example could be parsed using a ContentHandler like this:

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.ext.DefaultHandler2;

public class ExampleHandler extends DefaultHandler2 {

    private StringBuffer chars = new StringBuffer(1000);

    private MyEntry currentEntry;
    private MyEntryHandler myEntryHandler;

    ExampleHandler(MyEntryHandler myEntryHandler) {
        this.myEntryHandler = myEntryHandler;
    }

    @Override
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        chars.append(ch);
    }

    @Override
    public void endElement(String uri, String localName, String qName)
            throws SAXException {
        if ("Entry".equals(localName)) {
            myEntryHandler.handle(currentEntry);
            currentEntry = null;
        }
        else if ("n1".equals(localName)) {
            currentEntry.setN1(chars.toString());
        }
        else if ("n2".equals(localName)) {
            currentEntry.setN2(chars.toString());
        }
    }


    @Override
    public void startElement(String uri, String localName, String qName,
            Attributes atts) throws SAXException {
        chars.setLength(0);
        if ("Entry".equals(localName)) {
            currentEntry = new MyEntry();
        }
    }
}

If the document has a deeper and more complex structure, you're going to need to use Stacks to keep track of the current path in the document. Then you should consider writing a general purpose ContentHandler to do the dirty work and use with your document type dependent handlers.

Mochun answered 2/3, 2010 at 19:54 Comment(0)

What kind of parser are you using?

DOM pulls the whole document in memory - once you pull the whole document in memory then your operations can be fast but doing so in a web app or a for loop can have an impact.

SAX parser does on demand parsing and loads nodes as and when you request.

So try to use a parser implementation that suits your need.

Colliery answered 2/3, 2010 at 17:33 Comment(2)

If he's planning to use this on a document with millions of entries, SAX is probably the better way to go. IMHO. – Thrombokinase 2/3, 2010 at 17:36

but why is it so slow. It should be fast for all the entry tags – Cobaltic 2/3, 2010 at 17:37

Use the JAXEN library for xpaths: http://jaxen.codehaus.org/

Ductile answered 2/3, 2010 at 17:43 Comment(0)

Recommended topics

Hot tags