How to parse mathML in output of WordOpenXML?
Asked Answered
F

1

5

I want to read only the xml used for generating equation, which i obtained by using Paragraph.Range.WordOpenXML. But the section used for the equation is not as per MathML which as i found that the Equation of microsoft is in MathML.

Do I need to use some special converter to get desired xmls or are there any other methods?

Flatways answered 26/5, 2013 at 12:11 Comment(0)
R
7

You could use the OMML2MML.XSL file (located under %ProgramFiles%\Microsoft Office\Office15) to transform Microsoft Office MathML (equations) included in a word document into MathML.

The code below shows how to transform the equations in a word document into MathML using the following steps:

  1. Open the word document using OpenXML SDK (version 2.5).
  2. Create a XslCompiledTransform and load the OMML2MML.XSL file.
  3. Transform the word document by calling the Transform() method on the created XslCompiledTransform instance.
  4. Output the result of the transform (e.g. print on console or write to file).

I've tested the code below with a simple word document containing two equations, text and pictures.

using System.IO;
using System.Xml;
using System.Xml.Xsl;
using DocumentFormat.OpenXml.Packaging;

public string GetWordDocumentAsMathML(string docFilePath, string officeVersion = "14")
{
    string officeML = string.Empty;
    using (WordprocessingDocument doc = WordprocessingDocument.Open(docFilePath, false))
    {
        string wordDocXml = doc.MainDocumentPart.Document.OuterXml;

        XslCompiledTransform xslTransform = new XslCompiledTransform();

        // The OMML2MML.xsl file is located under 
        // %ProgramFiles%\Microsoft Office\Office15\
        xslTransform.Load(@"c:\Program Files\Microsoft Office\Office" + officeVersion + @"\OMML2MML.XSL");

        using (TextReader tr = new StringReader(wordDocXml))
        {
            // Load the xml of your main document part.
            using (XmlReader reader = XmlReader.Create(tr))
            {
                using (MemoryStream ms = new MemoryStream())
                {
                    XmlWriterSettings settings = xslTransform.OutputSettings.Clone();

                    // Configure xml writer to omit xml declaration.
                    settings.ConformanceLevel = ConformanceLevel.Fragment;
                    settings.OmitXmlDeclaration = true;

                    XmlWriter xw = XmlWriter.Create(ms, settings);

                    // Transform our OfficeMathML to MathML.
                    xslTransform.Transform(reader, xw);
                    ms.Seek(0, SeekOrigin.Begin);

                    using (StreamReader sr = new StreamReader(ms, Encoding.UTF8))
                    {
                        officeML = sr.ReadToEnd();
                        // Console.Out.WriteLine(officeML);
                    }
                }
            }
        }
    }
    return officeML;
}

To convert only one single equation (and not the whole word document) just query for the desired Office Math Paragraph (m:oMathPara) and use the OuterXML property of this node. The code below shows how to query for the first math paragraph:

string mathParagraphXml = 
      doc.MainDocumentPart.Document.Descendants<DocumentFormat.OpenXml.Math.Paragraph>().First().OuterXml;

Use the returned XML to feed the TextReader.

Rawden answered 27/5, 2013 at 16:56 Comment(6)
Thanks @Rawden this too is very good solution , and i was also wondering if i can read text of paragraph and inline objects as well..??Flatways
Again @Rawden it works perfectly for the case of whole document but using only specific oMath node doesnt work..? How it should be done..?Flatways
@Rawden could you please tell me what the WordprocessingDocument reference is? thanksGleanings
@JeremyThompson: The WordprocessingDocument is a class that represents a word document. The class is included in the OpenXML SDK from Microsoft.Rawden
Note that if you are working within Word itself, you can't substitute the Document.Content.XML. You must save the document, close it or make a copy (you can't read it while it's open in Word), and then open it with the OpenXML SDK. The equations in Document.Content.XML are escaped and embedded into a shape object.Screech
Also, when processing only equations, converting only equation paragraphs (Descendants<DocumentFormat.OpenXml.Math.Paragraph>) will miss any equation that has surrounding text. Iterate through OpenXml.Math.OfficeMath objects instead. You will sometimes see multiples of the same equation due to the way Word stores duplicates, but you won't miss any.Screech

© 2022 - 2024 — McMap. All rights reserved.