Simplify/ Clean up XML of a DOCX word document
Asked Answered
W

4

26

I have a Microsoft Word Document (docx) and I use Open XML SDK 2.0 Productivity Tool to generate C# code from it.

I want to programmatically insert some database values to the document. For this I typed in simple text like [[place holder 1]] in the points where my program should replace the placeholders with its database values.

Unfortunately the XML output is in some kind of mess. E.g. I have a table with two neighboring cells, which shouldn't distinguish apart from its placeholder. But one of the placeholders is split into several runs.

[[good place holder]]

<w:tc xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:tcPr>
    <w:tcW w:w="1798" w:type="dxa" />
    <w:shd w:val="clear" w:color="auto" w:fill="auto" />
  </w:tcPr>
  <w:p w:rsidRPr="008C2E16" w:rsidR="001F54BF" w:rsidP="000D7B67" w:rsidRDefault="0009453E">
    <w:pPr>
      <w:spacing w:after="0" w:line="240" w:lineRule="auto" />
      <w:rPr>
        <w:rFonts w:cstheme="minorHAnsi" />
        <w:sz w:val="20" />
        <w:szCs w:val="20" />
      </w:rPr>
    </w:pPr>
    <w:r w:rsidRPr="0009453E">
      <w:rPr>
        <w:rFonts w:cstheme="minorHAnsi" />
        <w:sz w:val="20" />
        <w:szCs w:val="20" />
      </w:rPr>
      <w:t>[[good place holder]]</w:t>
    </w:r>
  </w:p>
</w:tc>

versus [[bad place holder]]

<w:tc xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:tcPr>
    <w:tcW w:w="1799" w:type="dxa" />
    <w:shd w:val="clear" w:color="auto" w:fill="auto" />
  </w:tcPr>
  <w:p w:rsidRPr="008C2E16" w:rsidR="001F54BF" w:rsidP="000D7B67" w:rsidRDefault="00EA211A">
    <w:pPr>
      <w:spacing w:after="0" w:line="240" w:lineRule="auto" />
      <w:rPr>
        <w:rFonts w:cstheme="minorHAnsi" />
        <w:sz w:val="20" />
        <w:szCs w:val="20" />
      </w:rPr>
    </w:pPr>
    <w:r w:rsidRPr="00EA211A">
      <w:rPr>
        <w:rFonts w:cstheme="minorHAnsi" />
        <w:sz w:val="20" />
        <w:szCs w:val="20" />
      </w:rPr>
      <w:t>[[</w:t>
    </w:r>
    <w:proofErr w:type="spellStart" />
    <w:r w:rsidRPr="00EA211A">
      <w:rPr>
        <w:rFonts w:cstheme="minorHAnsi" />
        <w:sz w:val="20" />
        <w:szCs w:val="20" />
      </w:rPr>
      <w:t>bad</w:t>
    </w:r>
    <w:proofErr w:type="spellEnd" />
    <w:r w:rsidRPr="00EA211A">
      <w:rPr>
        <w:rFonts w:cstheme="minorHAnsi" />
        <w:sz w:val="20" />
        <w:szCs w:val="20" />
      </w:rPr>
      <w:t xml:space="preserve"> place holder]]</w:t>
    </w:r>
  </w:p>
</w:tc>

Is there any possibility to let Microsoft Word clean up my document, so that all place holders are good to identify in the generated XML?

Wira answered 13/10, 2011 at 10:45 Comment(1)
looks like your content is being chopped up by a "spelling error" marker... I'm curious as to why the word "bad" was identified as a spelling issue (is the document not set to english, maybe?), but never mind that; as amurra specified, you'll need to come up with a placeholder that doesn't feature in the target text but also isn't considered to be multiple words.Entomostracan
W
24

I have found a solution: the Open XML PowerTools Markup Simplifier.

I followed the steps described at http://ericwhite.com/blog/2011/03/09/getting-started-with-open-xml-powertools-markup-simplifier/, but it didn't work 1:1 (maybe because it is now version 2.2 of Power Tools?). So, I compiled PowerTools 2.2 in "Release" mode and made a reference to the OpenXmlPowerTools.dll in my TestMarkupSimplifier.csproj. In the Program.cs I only changed the path to my DOCX file. I ran the program once and my document seems to be fairly clean now.

Code quoted from Eric's blog in the link above:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using OpenXmlPowerTools;
using DocumentFormat.OpenXml.Packaging;

class Program
{
    static void Main(string[] args)
    {
        using (WordprocessingDocument doc = WordprocessingDocument.Open("Test.docx", true))
        {
            SimplifyMarkupSettings settings = new SimplifyMarkupSettings
            {
                RemoveComments = true,
                RemoveContentControls = true,
                RemoveEndAndFootNotes = true,
                RemoveFieldCodes = false,
                RemoveLastRenderedPageBreak = true,
                RemovePermissions = true,
                RemoveProof = true,
                RemoveRsidInfo = true,
                RemoveSmartTags = true,
                RemoveSoftHyphens = true,
                ReplaceTabsWithSpaces = true,
            };
            MarkupSimplifier.SimplifyMarkup(doc, settings);
        }
    }
}
Wira answered 14/10, 2011 at 13:16 Comment(0)
K
2

You need to get rid of the Rsid information. According to this page Rsid information

enables merging of two documents that have forked.

You need to install in order to run the sample code below. The easiest way to do that is to run the following in the Package Manager Console

Install-Package OpenXmlPowerTools

Then you will be all set to run the following code. (Assuming that you already have a "Test.docx" file added to your document. If you are using Visual Studio, you need to make sure that you have a copy of the file in either the Debug or Release folder according to your build mode.)

//Sample code to remove Rsid information from a "Test.docx" document

 using (WordprocessingDocument doc = WordprocessingDocument.Open("Test.docx", true))
        {
            SimplifyMarkupSettings settings = new SimplifyMarkupSettings
            {  
                RemoveRsidInfo = true 
            };
            MarkupSimplifier.SimplifyMarkup(doc, settings);
        }

This will remove Rsid information that may get in the way in the process of manipulating Word files.

Khamsin answered 4/1, 2016 at 19:57 Comment(0)
E
1

I do not know of a way to cleanup the XML, but I've always used #placeholder for my placeholder text and that seems to stay in one run more than any other placeholder text I've tried in the past. It seems the longer the placeholder text, the more likely it is to be split into multiple runs.

Erastianism answered 13/10, 2011 at 11:21 Comment(1)
That didn't work for me. It just reverted my manual changes... Thanks anyway.Wira
L
0

For those looking for manual non-programmatic solution:

http://www.translationtribulations.com/2010/06/cleaning-up-superfluous-tags-in-docx.html

I've tested that free-trial of memoQ 2014 can indeed be used as a bulky workaround for cleaning Word spell tags.

Still looking for an easier ready-out-of-the-box tool.

Lauder answered 6/10, 2014 at 12:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.