SAX Parser characters method doesn't collect all content
Asked Answered
C

2

6

I'm using SAX parser to parse XML and is working fine.

I have below tag in XML.

<value>•CERTASS >> Certass</value>

Here I expect '•CERTASS >> Certass' as output. but below code returns only Certass. Is there any issue with the special chars of value tag?

public void characters(char[] buffer, int start, int length) {
           temp = new String(buffer, start, length);
    }
Corney answered 22/7, 2015 at 16:29 Comment(0)
T
9

It is not guaranteed that the characters() method will run only once inside an element.

If you are storing the content in a String, and the characters() method happens to run twice, you will only get the content from the second run. The second time that the characters method runs it will overwrite the contents of your temp variable that was stored from the first time.

To remedy this, use a StringBuilder and append() the contents in characters() and then process the contents in endElement(). For example:

 DefaultHandler handler = new DefaultHandler() {
     private StringBuilder stringBuilder;

     @Override
     public void startElement(String uri, String localName,String qName, Attributes attributes) throws SAXException {
         stringBuilder = new StringBuilder();
     }

     public void characters(char[] buffer, int start, int length) {
         stringBuilder.append(new String(buffer, start, length));
     }

     public void endElement(String uri, String localName, String qName) throws SAXException {
         System.out.println(stringBuilder.toString());
     }
 };

Parsing the String "<value>•CERTASS >> Certass</value>" and the handler above gives the output:

?CERTASS >> Certass

I hope this helps.

Traynor answered 22/7, 2015 at 17:35 Comment(2)
Thanks for your answer. I have <value>wuch >> such PassPlus, >> Pass Plus, </value>. will this logic handle? Also, can I use DOM parser to avoid such issues?Corney
Actually a DOM parser wont even let you parse a document with unescaped special characters. Im glad I could help.Traynor
U
0

I ran into this problem the other day, it turns out the reason for this is the CHaracters method is being called multiple times in case any of these Characters are contained in the Value:

"   &quot;
'   &apos;
<   &lt;
>   &gt;
&   &amp;

Also be careful about Linebreaks / newlines within the value!!! If the xml is linewrapped without your controll the characters method wil also be called for each line that is in the statement, plus it will return the linebreak! (which you manually need to strip out in turn).

A sample Handler taking care of all these problems is this one:

 DefaultHandler handler = new DefaultHandler() {
   private boolean isInANameTag = false;
   private String localname;
   private StringBuilder elementContent;

   @Override
   public void startElement(String uri, String localName,String qName, Attributes attributes) throws SAXException {
    if (qname.equalsIgnoreCase("myfield")) {
      isInMyTag = true;
      this.localname = localname;
      this.elementContent = new StringBuilder();
    }
   }

   public void characters(char[] buffer, int start, int length) {
      if (isInMyTag) {
         String content = new String(ch, start, length);
         if (StringUtils.equals(content.substring(0, 1), "\n")) {
              // remove leading newline
              elementContent.append(content.substring(1));
         } else {
              elementContent.append(content);
         }
      }
   }

   public void endElement(String uri, String localName, String qName) throws SAXException {
     if (qname.equalsIgnoreCase("myfield")) {
       isInMyTag = false;
       // do something with elementContent.toString());
       System.out.println(elementContent.toString());
       this.localname = "";
     }
   }
}
Unpolled answered 24/10, 2019 at 9:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.