How can I query a Word docx in an ASP.NET app?
Asked Answered
E

3

5

I would like to upload a Word 2007 or greater docx file to my web server and convert the table of contents to a simple xml structure. Doing this on the desktop with traditional VBA seems like it would have been easy. Looking at the WordprocessingML XML data used to create the docx file is confusing. Is there a way (without COM) to navigate the document in more of an object-oriented fashion?

Exarate answered 18/8, 2009 at 21:50 Comment(0)
D
4

I highly recommend looking into the Open XML SDK 2.0. It's a CTP, but I've found it extremely useful in manipulating xmlx files without having to deal with COM at all. The documentation is a bit sketchy, but the key thing to look for is the DocumentFormat.OpenXml.Packaging.WordprocessingDocument class. You can pick apart the .docx document if you rename the extension to .zip and dig into the XML files there. From doing that, it looks like a Table of Contents is contained in a "Structured Document" tag and that things like the headings are in a hyperlink from there. Putzing around with it a bit, I found that something like this should work (or at least give you a starting point).

WordprocessingDocument wordDoc = WordprocessingDocument.Open(Filename, false);
SdtBlock contents = wordDoc.MainDocumentPart.Document.Descendants<SdtBlock>().First();
List<string> contentList = new List<string>();
foreach (Hyperlink section in contents.Descendants<Hyperlink>())
{
    contentList.Add(section.Descendants<Text>().First().Text);
}
Departmentalize answered 19/8, 2009 at 1:4 Comment(0)
O
3

Here is a blog post on querying Open XML WordprocessingML documents using LINQ to XML. Using that code, you can write a query as follows:

using (WordprocessingDocument doc =
    WordprocessingDocument.Open(filename, false))
{
    foreach (var p in doc.MainDocumentPart.Paragraphs())
    {
        Console.WriteLine("Style: {0}   Text: >{1}<",
            p.StyleName.PadRight(16), p.Text);
        foreach (var c in p.Comments())
            Console.WriteLine(
              "  Comment Author:{0}  Text:>{1}<",
              c.Author, c.Text);
    }
}

Blog post: Open XML SDK and LINQ to XML

-Eric

Othello answered 25/4, 2011 at 1:56 Comment(0)
O
0

See XML Documents and Data as a starting point. In particular, you'll want to use LINQ to XML.

In general, you do not want to use COM in a .NET application.

Oliva answered 18/8, 2009 at 21:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.