Python2 sax parser, best speed and performance for large files?
Asked Answered
L

1

6

So Ive been using suds with great benefit to consume a webservice.

Hit an issue with performance, for some data the cpu would spike hard, it would take more than 60s to complete the request, which is served by gunicorn, suds to webservice and so on.

Looking into it with line_profiler, objgraph, memory_profiler etc, I find the culprit is it about takes 13s to parse a 9.2mb xml file, which is the response from the webservice.

That can not be normal right? Just 9.2mb and I see 99% of the time is spent parsing it, and the parsing is done by "from xml.sax import make_parser" which means standard python?

Any faster xml parsers out there for big files?

Ill look into exactly what kind of structure is in the XML, but so far I know its "UmResponse" which contains around 7000 "Document" elements with each contains 10-20 lines of elements.

EDIT: Investigating further I see half of that 13s is spent in the suds Handler in suds/sax/ ... hm could be suds problem and not python library, of course.

EDIT2: suds unmarshaller used most of the time spent processing this, about 50s, parsing with sax was also slow, pysimplesoap which uses xml.minidom is taking about 13s and lots of memory. However lxml.etree is below 2s and objectify is also very fast, fast enough to use it instead of ElementTree (which is faster than cElementTree for this specific xml here, 0.5s for one 0.17s for other)

Solution: Suds allows parameter retxml to be true, to give back the XML without parsing and unmarshalling, from there I can do it faster with lxml.

Lapillus answered 4/3, 2014 at 11:48 Comment(6)
You could try using the Python libxml2 binding or lxml. It might be slightly faster, but I don't think it'll make a huge difference. You've stated that it takes over 60 seconds to complete the request and only 13 seconds is spent parsing the file. Assuming that by some miracle you could speed up parsing to take 0 seconds, you're still looking at almost 50 seconds to parse the file.Hiller
@Hiller thats true, looking further into this, 13s is for parsing using sax parser, or minidom parser they are about the same speed, but then suds unmarshaller method is using up the 50s. Ugh. So suds takes a lot of time due to sax parser and its ContentHandler, then to build some kind of suds DOM takes more time than minidom takes. Fastest is ElementTree and objectify.Lapillus
That's a very nice find. I use suds and should I ever be so (un)fortunate to need to parse a large XML response, your post is helpful.Hiller
If you found the solution; post it as an answer instead of editing your question.Jobholder
Glad someone finds this useful, suds is awesome otherwise.Lapillus
You should accept your answer @LapillusSlav
L
4

Suds parsing with sax took time and even much more the unmarhsalling method in suds src bindings/binding which uses the class umx/Typed quite a lot.

Solution, bypass all of that: Pass retxml=True to the client so that suds doesnt do parsing and unmarshalling, awesome option by suds! Instead doing it with lxml, which I found to be the fastest, somehow even faster than cElementTree.

from lxml import objectify
from lxml.etree import XMLParser

Now another problem was that the xml had huge txt noded, more than 10mb, so lxml would bail, the XMLParser needs the flag huge_tree=True to swallow and process the large data file. Set it like this, the set_element_class_lookup is whats really of great benefit, without it you dont really get an ObjectifedElement back.

parser = XMLParser(remove_blank_text=True, huge_tree=True)
parser.set_element_class_lookup(objectify.ObjectifyElementClassLookup())
objectify.set_default_parser(parser)
obj = objectify.fromstring(ret_xml)
# iter here and return Body or Body[0] or whatever you need
#so all code which worked with suds unmarshaller works with objectified aswell 

Then the rest of the code which looked up elements by property when suds had unmarshalled it worked fine (just after returning the Body of the soap envelope), no need to hassle with xpath or iteraparse xml elements.

objectify does it job in 1-2s compared to 50-60s for suds unmarshalling.

Lapillus answered 10/3, 2014 at 8:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.