Load XDocument asynchronously
Asked Answered
D

3

14

I want to load large XML documents into XDocument objects. The simple synchronous approach using XDocument.Load(path, loadOptions) works great, but blocks for an uncomfortably long time in a GUI context when loading large files (particularly from network storage).

I wrote this async version with the intention of improving responsiveness in document loading, particularly when loading files over the network.

    public static async Task<XDocument> LoadAsync(String path, LoadOptions loadOptions = LoadOptions.PreserveWhitespace)
    {
        String xml;

        using (var stream = File.OpenText(path))
        {
            xml = await stream.ReadToEndAsync();
        }

        return XDocument.Parse(xml, loadOptions);
    }

However, on a 200 MB XML raw file loaded from local disk, the synchronous version completes in a few seconds. The asynchronous version (running in a 32-bit context) instead throws an OutOfMemoryException:

   at System.Text.StringBuilder.ToString()
   at System.IO.StreamReader.<ReadToEndAsyncInternal>d__62.MoveNext()

I imagine this is because of the temporary string variable used to hold the raw XML in memory for parsing by the XDocument. Presumably in the synchronous scenario, XDocument.Load() is able to stream through the source file, and never needs to create a single huge String to hold the entire file.

Is there any way to get the best of both worlds? Load the XDocument with fully asynchronous I/O, and without needing to create a large temporary string?

Destefano answered 24/4, 2017 at 14:8 Comment(7)
Perhaps you should use XDocument.Load(stream)?Eloyelreath
How would that make the load operation asynchronous?Destefano
Well that in itself wouldn't, but it would eliminate the string variable you have here and hopefully the OOM exception.Eloyelreath
@Eloyelreath Which is what the OP said they already did. But they need to do the operation asynchronously, not synchronously.Spellbind
Wait for this or try to do it yourself.Parasympathetic
What I'm doing in the meantime is just calling XDocument.Load(String path, LoadOptions options) in a background Task using await Task.Run(). It's not true asynchronous IO since it uses a thread pool thread to run the loading process, possibly with a lot of waiting for IO under the hood, rather than being driven by IO events. Might be Good Enough though.Destefano
Based on that stack trace, it might be possible for you to load the whole thing into memory using MemoryStream. Then set MemoryStream.Position to 0 and load (synchronously) it with XDocument. That way you avoid needing to make a 200MB string (which is probably actually becoming 400MB with .net UTF-16 encoding of a file which is likely mostly ASCII and encoded to 200MB with UTF-8). However, the accepted answer allows you to fully avoid building the separate buffer which, in this environment, makes it the best choice even though it has blocking.Mervin
S
14

XDocument.LoadAsync() is available in .NET Core 2.0: https://learn.microsoft.com/en-us/dotnet/api/system.xml.linq.xdocument.loadasync?view=netcore-2.0

Showdown answered 20/2, 2018 at 4:53 Comment(0)
D
3

Late answer, but I needed the async read as well on a "legacy" .NET Framework version so I figured out a way to truly read the content in an async way without reverting to buffering the XML data in memory.

Since the writer provided by XDocument.CreateWriter() does not support async writing and thus XmlWriter.WriteNodeAsync() fails, the code performs async reads and converts this to sync writes on the XDocument-writer. The code is inspired by the way XmlWriter.WriteNodeAsync() works however. Since the writer builds an in-memory DOM this is actually even better than actually doing async writes.

public static async Task<XDocument> LoadAsync(Stream stream, LoadOptions loadOptions) {
    using (var reader = XmlReader.Create(stream, new XmlReaderSettings() {
            DtdProcessing = DtdProcessing.Ignore,
            IgnoreWhitespace = (loadOptions&LoadOptions.PreserveWhitespace) == LoadOptions.None,
            XmlResolver = null,
            CloseInput = false,
            Async = true
    })) {
        var result = new XDocument();
        using (var writer = result.CreateWriter()) {
            do {
                switch (reader.NodeType) {
                case XmlNodeType.Element:
                    writer.WriteStartElement(reader.Prefix, reader.LocalName, reader.NamespaceURI);
                    writer.WriteAttributes(reader, true);
                    if (reader.IsEmptyElement) {
                        writer.WriteEndElement();
                    }
                    break;
                case XmlNodeType.Text:
                    writer.WriteString(await reader.GetValueAsync().ConfigureAwait(false));
                    break;
                case XmlNodeType.CDATA:
                    writer.WriteCData(reader.Value);
                    break;
                case XmlNodeType.EntityReference:
                    writer.WriteEntityRef(reader.Name);
                    break;
                case XmlNodeType.ProcessingInstruction:
                case XmlNodeType.XmlDeclaration:
                    writer.WriteProcessingInstruction(reader.Name, reader.Value);
                    break;
                case XmlNodeType.Comment:
                    writer.WriteComment(reader.Value);
                    break;
                case XmlNodeType.DocumentType:
                    writer.WriteDocType(reader.Name, reader.GetAttribute("PUBLIC"), reader.GetAttribute("SYSTEM"), reader.Value);
                    break;
                case XmlNodeType.Whitespace:
                case XmlNodeType.SignificantWhitespace:
                    writer.WriteWhitespace(await reader.GetValueAsync().ConfigureAwait(false));
                    break;
                case XmlNodeType.EndElement:
                    writer.WriteFullEndElement();
                    break;
                }
            } while (await reader.ReadAsync().ConfigureAwait(false));
        }
        return result;
    }
}
Darkish answered 21/1, 2021 at 12:36 Comment(0)
T
2

First of all the task is not being run asynchronously. You would need to use either a built in async IO command or spin up a task on the thread pool yourself. For example

public static Task<XDocument> LoadAsync
 ( String path
 , LoadOptions loadOptions = LoadOptions.PreserveWhitespace
 )
{
    return Task.Run(()=>{
     using (var stream = File.OpenText(path))
        {
            return XDocument.Load(stream, loadOptions);
        }
    });
}

and if you use the stream version of Parse then you don't get a temporary string.

Taintless answered 25/4, 2017 at 7:35 Comment(6)
Ok. This is what I outlined in my final comment on the question. So this will be using a thread-pool thread to drive the implicitly required I/O, as the XDocument chews its way through the stream. And that I/O will itself be sporadically blocking the Task's worker thread. Looks like this is the best that can be done, in the absence of a true XDocument.LoadAsync() implementation which uses proper Async I/O instructions under the hood. I don't see any advantage to explicitly calling File.OpenText though. May as well just call XDocument.Load(path)Destefano
If you were reading 10s of thousands of XDocuments on a server in parallel you might be worried about stealing a thread from the thread pool rather than using true async IO but is this really a concern?Taintless
Probably not. Hence my comment that it's probably good enough. I upvoted and accepted anywayDestefano
Hey whatever happened to the 'await' in the 'return await Task.Run' ?Maestoso
@Maestoso There is no need for an await here because the task being returned is the last processing in the method. However, because there is no await keyword, there should not be an async keyword in the method definition either.Aerobe
@Aerobe fixedTaintless

© 2022 - 2024 — McMap. All rights reserved.