Python2 sax parser, best speed and performance for large files?

So Ive been using suds with great benefit to consume a webservice.

Hit an issue with performance, for some data the cpu would spike hard, it would take more than 60s to complete the request, which is served by gunicorn, suds to webservice and so on.

Looking into it with line_profiler, objgraph, memory_profiler etc, I find the culprit is it about takes 13s to parse a 9.2mb xml file, which is the response from the webservice.

That can not be normal right? Just 9.2mb and I see 99% of the time is spent parsing it, and the parsing is done by "from xml.sax import make_parser" which means standard python?

Any faster xml parsers out there for big files?

Ill look into exactly what kind of structure is in the XML, but so far I know its "UmResponse" which contains around 7000 "Document" elements with each contains 10-20 lines of elements.

EDIT: Investigating further I see half of that 13s is spent in the suds Handler in suds/sax/ ... hm could be suds problem and not python library, of course.

EDIT2: suds unmarshaller used most of the time spent processing this, about 50s, parsing with sax was also slow, pysimplesoap which uses xml.minidom is taking about 13s and lots of memory. However lxml.etree is below 2s and objectify is also very fast, fast enough to use it instead of ElementTree (which is faster than cElementTree for this specific xml here, 0.5s for one 0.17s for other)

Solution: Suds allows parameter retxml to be true, to give back the XML without parsing and unmarshalling, from there I can do it faster with lxml.

Suds parsing with sax took time and even much more the unmarhsalling method in suds src bindings/binding which uses the class umx/Typed quite a lot.

Solution, bypass all of that: Pass retxml=True to the client so that suds doesnt do parsing and unmarshalling, awesome option by suds! Instead doing it with lxml, which I found to be the fastest, somehow even faster than cElementTree.

from lxml import objectify
from lxml.etree import XMLParser

Now another problem was that the xml had huge txt noded, more than 10mb, so lxml would bail, the XMLParser needs the flag huge_tree=True to swallow and process the large data file. Set it like this, the set_element_class_lookup is whats really of great benefit, without it you dont really get an ObjectifedElement back.

parser = XMLParser(remove_blank_text=True, huge_tree=True)
parser.set_element_class_lookup(objectify.ObjectifyElementClassLookup())
objectify.set_default_parser(parser)
obj = objectify.fromstring(ret_xml)
# iter here and return Body or Body[0] or whatever you need
#so all code which worked with suds unmarshaller works with objectified aswell

Then the rest of the code which looked up elements by property when suds had unmarshalled it worked fine (just after returning the Body of the soap envelope), no need to hassle with xpath or iteraparse xml elements.

objectify does it job in 1-2s compared to 50-60s for suds unmarshalling.

Recommended topics

Hot tags