Reading big chunk of xml data from socket and parse on the fly
Asked Answered
M

2

2

I am working on an android client which reads continues stream of xml data from my java server via a TCP socket. The server sends a '\n' character as delimiter between consecutive responses. Below given is a model implementation..

<response1>
   <datas>
      <data>
           .....
           .....
      </data>
      <data>
           .....
           .....
      </data>
      ........
      ........
   </datas>
</response1>\n    <--- \n acts as delimiter ---/> 
<response2>

   <datas>
      <data>
           .....
           .....
      </data>
      <data>
           .....
           .....
      </data>
      ........
      ........
   </datas>
</response2>\n

Well I hope the structure is clear now. This response is transmitted from server zlib compressed. So I have to first inflate whatever I am reading from the server, separate on response using delimiter and parse. And I am using SAX to parse my XML

Now my main problem is the xml response coming from server can be very large (can be in the range of 3 to 4 MB). So

  • to separate responses based on delimiter (\n) I have to use a stringBuilder to store response blocks as it reads from socket and on some phones StringBuilder cannot store strings in the MegaBytes range. It is giving OutOfMemory exception, and from threads like this I got to know keeping large strings (even on a temporary basis) is not such a good idea.

  • Next I tried to pass the inflatorReadStream (which in turn takes data from socket input stream) as the input stream of SAX parser (without bothering to separate xml myself and relying on SAX's ability to find the end of document based on tags). This time one response gets parsed successfully, but then on finding the '\n' delimiter SAX throws ExpatParserParseException saying junk after document element .

  • After catching that ExpatParserParseException I tried to read again, but after throwing exception SAX Parser closes the stream, so when I try to read/parse again, it is giving IOException saying input stream is closed.

A code snippet of what I have done is given below (removed all unrelated try catch blocks for clarity).

private Socket clientSocket     =   null;
DataInputStream readStream      =   null;
DataOutputStream writeStream        =   null;
private StringBuilder incompleteResponse    =   null;
private AppContext  context     =   null;


public boolean connectToHost(String ipAddress, int port,AppContext myContext){
        context                     =   myContext;
        website                     =   site;
        InetAddress serverAddr          =   null;

    serverAddr                      =   InetAddress.getByName(website.mIpAddress);

    clientSocket                    =   new Socket(serverAddr, port);

    //If connected create a read and write Stream objects..
    readStream   =  new DataInputStream(new InflaterInputStream(clientSocket.getInputStream()));
    writeStream             =   new DataOutputStream(clientSocket.getOutputStream());

    Thread readThread = new Thread(){
            @Override
            public void run(){                              
            ReadFromSocket();                   
        }
    };
    readThread.start();     
    return true;
}


public void ReadFromSocket(){
   while(true){
       InputSource xmlInputSource = new InputSource(readStream);
       SAXParserFactory spf =   SAXParserFactory.newInstance();
       SAXParser sp =   null;
       XMLReader xr =   null;
       try{
           sp   = spf.newSAXParser();
       xr   = sp.getXMLReader();
       ParseHandler xmlHandler =    new ParseHandler(context.getSiteListArray().indexOf(website), context);
       xr.setContentHandler(xmlHandler);
       xr.parse(xmlInputSource);
   //  postSuccessfullParsingNotification();
       }catch(SAXException e){
           e.printStackTrace();
           postSuccessfullParsingNotification();
       }catch(ParserConfigurationException e){
           e.printStackTrace();
           postSocketDisconnectionBroadcast();
           break;
       }catch (IOException e){
           postSocketDisconnectionBroadcast();
           e.printStackTrace();
           e.toString();
           break;
       }catch (Exception e){
           postSocketDisconnectionBroadcast();
           e.printStackTrace();
           break;
       }
    }
}

And now my questions are

  1. Is there any way to make SAX Parser ignore junk characters after on xml response, and not throw exception and close the stream..
  2. If not is there any way to avoid out of memory error on stringBuilder. To be frank,I am not excepting a positive answer on this. Any workaround?
Marielamariele answered 16/8, 2011 at 5:42 Comment(0)
T
2
  1. You might be able to use a wrapper around the reader or stream you pass to the filter that detects the newline and then closes the parser and launches a new parser that continues with the stream: your stream is NOT valid XML and you won't be able to parse it as you currently have implemented. Take a look at http://commons.apache.org/io/api-release/org/apache/commons/io/input/CloseShieldInputStream.html.
  2. No.
Tattoo answered 16/8, 2011 at 5:54 Comment(2)
Thanks @Femi, for the response...CloseShieldInputStream looks promising..Let me fool around with it...Will get back to you..Marielamariele
Well @Femi, CloseShieldInputStream did the trick for the moment..Now I am able to catch ExpatParserParseException, ignore it, and read again..So for the moment my code runs well..Marielamariele
B
1

If your SAX parser supports a push model (where you push raw data chunks into it yourself and it fires events as it parses the raw data), then you can simply push your own initial XML tag at the beginning of the SAX session. That will become the top-level document tag, then you can push the responses as you receive them and they will be second-level tags as far as SAX is concerned. That way, you can push multiple responses in the same SAX session, and then in the OnTagOpen event (or wheatever you are using), you will know when a new response begins when you detect its tag name at level 1.

Blackball answered 16/8, 2011 at 23:37 Comment(11)
Well @Remy, thanks for the reply, you will know when a new response begins when you detect its tag name at level 1 In my case second response is not getting started, after SAX completely parsed first response, it throws an exception on second response, because of the delimiter '\n' in between two responses.. And also it close the stream..Marielamariele
That would only be true if you are parsing the responses as top-level documents. XML only supports 1 tag at level 0. You cannot have multiple top-level tags. Hense my suggestion to push the responses to level 1 by wrapping them in a custom tag at level 0. I use this technique when pushing streaming XML from TCP through a SAX parser and it works very well.Blackball
ok..Now I am getting what you mean..You create a temporary top level tag wrapping the whole xml chunk read from socket right..But consider the case, if one chunk is about quarter of a response (response is that big), when I wrap around that chunk with my own tag, and this chunk wont close many inner tags (closing of these tags comes only in the next chunk)..Will xml broke then..Marielamariele
You don't wrap the individual chunks. You push them as-is. You only wrap the entire parsing session instead. Push a custom open tag at the beginning of the session, then read as many chunks as you need to, pushing each one as-is. Each response will have its own open/close tags normally. When there are no more responses to read, push the custom close tag to finish the session.Blackball
but the problem is in my case there are multiple responses which is that big..So one chunk read from stream might contain end part of one big response, a delimiter ('\n'), some part of next response..I cannot safely say that one response ends here..Marielamariele
You are still not getting it. A push-based SAX parser does not care about those issues at all! It is designed to work with arbitrary chunks of any size and any portion of the total data. As arbitrary chunks are pushed into the parser, the parser buffers what it needs internally, and then fires events whenever finished tags exist in the buffer. You will receive events for each individual response, datas, and data tag that is pushed into the parser, regardless of how many chunks it took to receive them from the stream...Blackball
... and you don't have to worry about the line break at all, because it falls between tags and is thus just whitespace that can be discarded.Blackball
well now (after one whole month) I am starting to get what you mean(but I was very new to streamed parsing and SAX that day)..But I have a question on your logic, What if each of these individual responses have a doctype(<?xml version="1.0" encoding="UTF-8"?>) element? I think one xml can only have one doctype element and it should be at the start, right?..According to your logic, all my responses become a part of one big response, so it shouldn't contain doctype inside the response..Any chance for SAX to ignore inside doctypes?Marielamariele
If there are multiple doctypes, then you will have to strip them off manually before then pushing the rest of the data into the parser (but only if the charset is always the same, or if the parser allows you to change its active charset dynamically). Otherwise, you would have to do your own buffering and only parse individual documents again with using a push parser at all.Blackball
hmm..Just as I thought then..So back to square one again..I don't have any control over this xml stream..So only think I can do is to change my parsing code..Thanks Remy for the response...Marielamariele
@Marielamariele Few years agos, I had a C++ project where I didn't have a delimiter between the XML messages I received from a socket (and each message was beginning with a doctype tag), so I managed to delimit one document in the buffer (vector) and I fed this to Xerces-C++ to be parsed in a SAX way (pointer to buffer and the number of bytes), after parsing I just shift the buffer of the number of bytes of the document that has been parsed and I wait for the next document. I think that the proper way is that the server before sending XML, it should send the size of it (fixed number of bytes).Hose

© 2022 - 2024 — McMap. All rights reserved.