A lightweight XML parser efficient for large files?
Asked Answered
F

9

8

I need to parse potentially huge XML files, so I guess this rules out DOM parsers.

Is out there any good lightweight SAX parser for C++, comparable with TinyXML on footprint? The structure of XML is very simple, no advanced things like namespaces and DTDs are needed. Just elements, attributes and cdata.

I know about Xerces, but its sheer size of over 50mb gives me shivers.

Thanks!

Fiertz answered 17/6, 2009 at 11:53 Comment(1)
E
7

If you are using C, then you can use LibXML from the Gnome project. You can choose from DOM and SAX interfaces to your document, plus lots of additional features that have been developed over years. If you really want C++, then you can use libxml++, which is a C++ OO wrapper around LibXML.

The library has been proven again and again, is high performance, and can be compiled on almost any platform you can find.

Enyo answered 17/6, 2009 at 11:59 Comment(4)
Thanks for the answer. Is LibXML lightweight? How many kbytes does it add to the executable?Fiertz
If you're using a dynamic library (UNIX shared lib / Windows DLL), then the answer is "none". Just a quick check on my Linux box shows that the shared lib is 1.2M and the static library (to be used in compiling in to programs) is 1.5M. So if you did a static compile you'd be adding 1.5M-ish to your exe.Enyo
My whole .exe is around 350Kb, so I guess I'll be willing to find something more lightweight.. but thanks anywayFiertz
If you're truly worried about size, try Expat at expat.sourceforge.net It's shared library size on my Linux box is 133K. I'm guessing that a statically compiled .a into your code would be that much or so.Enyo
Z
6

I like ExPat
http://expat.sourceforge.net/

It is C based but there are several C++ wrappers around to help.

Zermatt answered 17/6, 2009 at 17:1 Comment(0)
C
3

RapidXML is quite a fast parser for XML written in C++.

Critique answered 23/1, 2010 at 21:44 Comment(2)
Crashes on Android. Can't use ExceptionsTiler
This is a DOM parser, but it parses "in-situ", i.e. it changes the source XML data, so you have to load all of the data.Tol
S
2

http://sourceforge.net/projects/wsdlpull this is a straight c++ port of the java xmlpull api (http://www.xmlpull.org/)

I would highly recommend this parser. I had to customize it for use on my embedded device (no STL support) but I have found it to be very fast with very little overhead. I had to make my own string and vector classes, and even with those it compiles to about 60k on windows.

I think that pull parsing is a lot more intuitive than something like SAX. The code much more closely mirrors the xml document making it easy to correlate the two.

The one downside is that it is forward only, meaning that you need to parse the elements as them come. We have a fairly messed up design for reading our config files, and I need to parse a whole subtree, make some checks, then set some defaults then parse again. With this parser the only real way to handle something like that is to make a copy of the state, parse with that, then continue on with the original. It still ends up being a big win in terms of resources vs our old DOM parser.

Sopor answered 17/6, 2009 at 18:37 Comment(1)
It parses a character at a time and uses an int for the character. For element an attribute names, it has a rather restrictive definition of what a valid identifier is (ascii basically) but it probably wouldn't take much to change it. It comes with a project that does a parse/serialize test, so it is pretty easy to run it across some representative data to try it out.Sopor
C
1

If your XML structure is very simple you can consider building a simple lexer/scanner based on lex/yacc (flex/bison) . The sources at the W3C may inspire you: http://www.w3.org/XML/9707/parser.y and http://www.w3.org/XML/9707/scanner.l.

See also the SAX2 interface in libxml

Carangid answered 17/6, 2009 at 12:1 Comment(0)
G
1

firstobject's CMarkup is a C++ class that works as a lightweight huge file pull parser (I recommend a pull parser rather than SAX), and huge XML file writer too. It adds up to about 250kb to your executable. When used in-memory it has 1/3 the footprint of tinyxml by one user's report. When used on a huge file it only holds a small buffer (like 16kb) in memory. CMarkup is currently a commercial product so it is supported, documented, and designed to be easy to add to your project with a single cpp and h file.

The easiest way to try it out is with a script in the free firstobject XML editor such as this:

ParseHugeXmlFile()
{
  CMarkup xml;
  xml.Open( "HugeFile.xml", MDF_READFILE );
  while ( xml.FindElem("//record") )
  {
    // process record...
    str sRecordId = xml.GetAttrib( "id" );
    xml.IntoElem();
    xml.FindElem( "description" );
    str sDescription = xml.GetData();
  }
  xml.Close();
}

From the File menu, select New Program, paste this in and modify it for your elements and attributes, press F9 to run it or F10 to step through it line by line.

Greenland answered 28/9, 2009 at 17:3 Comment(0)
A
1

you can try https://github.com/thinlizzy/die-xml . it seems to be very small and easy to use

this is a recently made C++0x XML SAX parser open source and the author is willing feedbacks

it parses an input stream and generates events on callbacks compatible to std::function

the stack machine uses finite automata as a backend and some events (start tag and text nodes) use iterators in order to minimize buffering, making it pretty lightweight

Acrodont answered 23/11, 2011 at 14:40 Comment(0)
F
0

I'd look at tools that generate a DTD/Schema-specific parser if you want small and fast. These are very good for huge documents.

Feriga answered 4/9, 2009 at 3:46 Comment(0)
W
0

I highly recommend pugixml

pugixml is a light-weight C++ XML processing library.

"pugixml is a C++ XML processing library, which consists of a DOM-like interface with rich traversal/modification capabilities, an extremely fast XML parser which constructs the DOM tree from an XML file/buffer, and an XPath 1.0 implementation for complex data-driven tree queries. Full Unicode support is also available, with Unicode interface variants and conversions between different Unicode encodings."

I have tested a few XML parsers including a few expensive ones before choosing and using pugixml in a commercial product.

pugixml was not only the fastest parser but also had the most mature and friendly API. I highly recommend it. It is very stable product! I have started to use it since version 0.8. Now it is 1.7.

The great bonus in this parser is XPath 1.0 implementation! For any more complex tree queries the XPath is a God sent feature!

DOM-like interface with rich traversal/modification capabilities is extremely useful to tackle a real life "heavy" XML files.

It is small, fast parser. It is good choice even for iOS or Android app if you do not mind linking C++ code.

Benchmarks can tell a lot. See: http://pugixml.org/benchmark.html

A few examples for (x86):

pugixml is more than 38 times faster than TinyXML

                    4.1 times faster than CMarkup,

                    2.7 times faster than expat or libxml

For (x64) pugixml is the fastest parser which I know.

Check also the usage of the memory by your XML parser. Some parsers just gobble precious memory!

Wawro answered 9/6, 2016 at 0:46 Comment(1)
The question asked for a SAX parser. Not really viable to load extremely large XML files into a DOM structure.Intercommunicate

© 2022 - 2024 — McMap. All rights reserved.