How to combine large XML files using MSXML SAX in Delphi
Asked Answered
M

4

8

Edit: My (incomplete and very rough) XmlLite header translation is available on GitHub

What is the best way to do a simple combine of massive XML documents in Delphi with MSXML without using DOM? Should I use the COM components SAXReader and XMLWriter and are there any good examples?

The transformation is a simple combination of all the Contents elements from the root (Container) from many big files (60MB+) to one huge file (~1GB).

<Container>
    <Contents />
    <Contents />
    <Contents />
</Container>

I have it working in the following C# code using an XmlWriter and XmlReaders, but it needs to happen in a native Delphi process:

var files = new string[] { @"c:\bigFile1.xml", @"c:\bigFile2.xml", @"c:\bigFile3.xml", @"c:\bigFile4.xml", @"c:\bigFile5.xml", @"c:\bigFile6.xml" };

using (var writer = XmlWriter.Create(@"c:\HugeOutput.xml", new XmlWriterSettings{ Indent = true }))
{
    writer.WriteStartElement("Container");

    foreach (var inputFile in files)
        using (var reader = XmlReader.Create(inputFile))
        {
            reader.MoveToContent();
            while (reader.Read())
                if (reader.IsStartElement("Contents"))
                    writer.WriteNode(reader, true);
        }

    writer.WriteEndElement(); //End the Container element
}

We already use MSXML DOM in other parts of the system and I do not want to add new components if possible.

Misprint answered 4/8, 2011 at 14:13 Comment(10)
So you want to use SAX to avoid consuming a few gigs of RAM? Does this SAX-with-MSXML demo help? keith-wood.name/DelphiXML/BookCode/Chapter%2013/index.htmlAnselme
Yes, Delphi compiles 32-bit only and the DOM-based TXMLDocument wrapper for MSXML chokes with EOutOfMemory when documents reach ~100MB.Misprint
My opinion is drop MSXML completely, and go with OmniXML. :-) You should be able to load a 1 gig XML file into a 32 bit process, in any sanely designed XML engine.Anselme
This is a big enterprise system and we already use MSXML. Adding/switching components is a whole new problem ITO dependencies, testing, and training... That is if I can convince our architect to buy in.Misprint
I've always preferred to build a working solution and then later let the people who think they are in control of this find a way to rationalize the fact that the crap we had sucked, and the new stuff is boss, and then rewrite their internal bikeshed documentation to match reality. Enterprise = Lots of panties in a knot over how bad it would be if anything bad happens. :-)Anselme
@warren SAX is the way to go for large data. DOM blows chunks for large data in 32 bit address space.Genseric
I tried OmniXML, but it also chokes very quickly.Misprint
Okay, I hope you can find some stable SAX code. I would have thought MSXML SAX would be just as broken as MSXML (and I'm guessing it is?)Anselme
Updated XMLLite declarations: github.com/the-Arioch/Delphi-XmlLite/commit/…Epimenides
I don't know if kluug's semi-commercial OXML would do better - but he does not answers mails so it is no option anyway. OmniXML is problematic for somewhat large files (I added a pseudo-answer below). For small XML files I usually use SuperObject lib, it is easy for lazy using :-)Epimenides
L
3

XmlLite is a native C++ port of xml reader and writer from System.Xml, which provides the pull parsing programming model. It is in-the-box with W2K3 SP2, WinXP SP3 and above. You'll need a Delphi header translation before almost 1-1 mapping from C# to Delphi.

Leahy answered 7/8, 2011 at 0:41 Comment(5)
the Delphi/Object Pascal persistence framework tiOPF (wiki.freepascal.org/tiOPF) supports XmlLite so I guess this open source project already includes the header translationsHyposensitize
Thanks Samuel, MS XmlLite works well! tiOPF seems to have something else called XmlLite (or I could not find the unit), so I wrote my own header translation for the bits I needed.Misprint
@carlmon: maybe you could share your header translation?Fridell
@Smasher It is very rough, but I created a repo: github.com/GenasysTechnologies/Delphi-XmlLiteMisprint
@Misprint I fixed some declarations there, hopefully win64 ready now. Additionally I think about no more caring about pre-2010 Delphi and pre-2.6.0 FPC. See comments at github.com/the-Arioch/Delphi-XmlLite/commit/…Epimenides
M
1

I'd just use regular file I/O to writeln a to a text file, writeln each of the contents as a string, and finally writeln . If you had a more reasonable size, I'd assemble everything in a stringlist and then stream that to disk. But if you're into GB territory, that would be risky.

Malamud answered 4/8, 2011 at 14:23 Comment(3)
Surely the delphi SAX-with-MSXML thing is functional though?Anselme
I may resort to this, but I forgot to mention one variable-sized header element in the files that need to be ignored for the output. It makes straight filestream a bit hacky...Misprint
Resorting to this rather than using a tested working SAX parser would be silly. (I won't use new components, unless I invent them from scratch?)Anselme
H
1

libxml with the Delphi wrapper Libxml2 might be an option (found here), it has some SAX support and seems to be very solid - the web page mentions that libxml2 passed all 1800+ tests from the OASIS XML Tests Suite. See also: Is there a SAX Parser for Delphi and Free Pascal?

Hyposensitize answered 4/8, 2011 at 14:57 Comment(2)
I wrote my own LibXML wrapper for Delphi 5 a few years ago, but we standardized on MSXML in newer Delphi to avoid bloat & dependencies - we were linking or shipping 3 different XML engines at one stage o_O.Misprint
So now you're down to 1 and it's the buggiest one and it's part of the OS instead of shipping a known good version with your app. :-)Anselme
F
0

Posting this as answer because it needs some space and formatting.

I've got one baaad data file for tests see the message at https://github.com/the-Arioch/omnixml/commit/d1a544048e86921983fced67c772944f12cb1427

Here OmniXML kind of sucks in XE2 debug build:

  • About 25% more memory use than TXmlDocument/MSXML. Maybe even more after fixing .NextSibling issue, did not re-test.
  • longer file loading time ( OTOH significantly faster reading node properties: they are already Delphi-typed variables, no crossing of MSXML/Delphi boundary )
  • absolutely no support for namespaces, which makes recognizing tags way harder
  • XPath in embryo state, including yet again lack of namespaces

https://docs.google.com/spreadsheets/d/1QcFVwh3fFfaDyRmv2b-n4Rq4_u5p42UfNbR_FZgZizY/edit?usp=sharing

Falstaffian answered 4/10, 2016 at 12:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.