Open XML - find and replace multiple placeholders in document template [duplicate]
Asked Answered
M

1

5

I know there are many posts on SO about this topic, but none seems to treat this particular issue. I'm trying to make a small generic document generator POC. I'm using Open XML.

The code goes like this:

   private static void ReplacePlaceholders<T>(string templateDocumentPath, T templateObject)
        where T : class
    {

        using (var templateDocument = WordprocessingDocument.Open(templateDocumentPath, true))
        {
            string templateDocumentText = null;
            using (var streamReader = new StreamReader(templateDocument.MainDocumentPart.GetStream()))
            {
                templateDocumentText = streamReader.ReadToEnd();
            }

            var props = templateObject.GetType().GetProperties();
            foreach (var prop in props)
            {
                var regexText = new Regex($"{prop.Name}");
                templateDocumentText =
                    regexText.Replace(templateDocumentText, prop.GetValue(templateObject).ToString());
            }

            using var streamWriter = new StreamWriter(templateDocument.MainDocumentPart.GetStream(FileMode.Create));
                streamWriter.Write(templateDocumentText);
        }
    }

The code works as intended. Problem is the following:

enter image description here

StreamReader.ReadToEnd() splits my placeholders between tags, so my Replace method, replaces only the words which won't get split.

In this case, my code will search for the word "Firstname" but will find "irstname" instead, so it won't replace it.

Is there any way to scan the whole .docx word by word and replace them?


(edit) A partial solution / workaround I found: - I noticed that you have to write the placeholder in the .docx at once (without reediting it). For example if I write "firstname", then come back and modify it to "Firstname" it will split the word into "F" "irstname". Without editng it will be unsplitted.

Mourning answered 12/12, 2019 at 11:32 Comment(7)
Hello Cindy Meister, and thank you. Unfortunately not really, the reason why, is explained by Thomas Barnekow in the comments: "This answer is not correct if you want this to work with documents produced by or edited with Microsoft Word, for example. While the standard allows a w:r element (Run instance) to contain more than one w:t element (Text instance), a w:r element typically contains at most one w:t element. Thus, if a text is split across multiple w:t elements, it is most likely also split across multiple w:r elements. Have a look at markup produced by Microsoft Word to confirm this."Mourning
@petelids, this answer has the same conceptual flaw as the one linked by Cindy.Eldoneldora
@ThomasBarnekow I don't believe it does. It's looking at paragraphs, not runs.Methylene
@petelids, yes, you are right, it is not the same conceptual flaw. However, it is still flawed, because the answer ignores run formatting (w:rPr), symbols (w:sym), fields (e.g., REF), and content controls (w:sdt), for example.Eldoneldora
@ThomasBarnekow - The answer is not supposed to be production code. It says towards the end "The only downside with the above approach is that any styles you may have had will be lost.". Granted having "the only" in that sentence is perhaps not the most accurate (so I'll edit it) but I think that points to the fact you will lose some things. It does however find the text you are looking for whereas the one linked by Cindy does not.Methylene
@petelids, I agree that your answer is better than the one linked by Cindy because your approach works in more use cases, e.g., for paragraphs that only contain text runs without any run-specific formatting. However, many real-life Word documents contain run-specific formatting or other elements that will get lost in your approach.Eldoneldora
To whoever closed this question and linked the other "answers": Those answers are not correct or at least have significant limitations in practice. What is the best way to deal with this on stackoverflow.com?Eldoneldora
E
10

TLDR

In short words, the solution to your problem is to use the OpenXmlRegex utility class of the Open-Xml-PowerTools as demonstrated in the unit test further below.

WHY?

Using Open XML, you can represent the same text in multiple ways. If Microsoft Word is involved in creating that Open XML markup, the edits made to produce that text will play an important part. This is because Word keeps track of which edits were made in which editing session. So, for example, the w:p (Paragraph) elements shown in the following extreme scenarios represent precisely the same text. And anything between those two examples is possible, so any real solution must be able to deal with that.

Extreme Scenario 1: Single w:r and w:t Element

The following markup is nice and easy:

<w:p>
  <w:r>
    <w:t>Firstname</w:t>
  </w:r>
</w:p>

Extreme Scenario 2: Single-Character w:r and w:t Elements

While you typically won't find the following markup, it represents the theoretical extreme in which each character has its own w:r and w:t element.

<w:p>
  <w:r>
    <w:t>F</w:t>
    <w:t>i</w:t>
    <w:t>r</w:t>
    <w:t>s</w:t>
    <w:t>t</w:t>
    <w:t>n</w:t>
    <w:t>a</w:t>
    <w:t>m</w:t>
    <w:t>e</w:t>
  </w:r>
</w:p>

Why did I use this extreme example if it does not occur in practice, you might ask? The answer is that it plays an essential role in the solution in case you want to roll your own.

HOW TO ROLL YOUR OWN?

To do it right, you must:

  1. transform the runs (w:r) of your paragraph (w:p) into single-character runs (i.e., w:r elements with one single-character w:t or one w:sym each), retaining the run properties (w:rPr);
  2. perform the search-and-replace operation on those single-character runs (using a few other tricks); and
  3. considering the potentially different run properties (w:rPr) of the runs resulting from the search-and-replace action, transform such resulting runs back into the fewest number of "coalesced" runs required to represent the text and its formatting.

When replacing text, you should not lose or alter the formatting of the text that is unaffected by your replacement. You should also not remove unaffected fields or content controls (w:sdt). Ah, and by the way, don't forget revision markup such as w:ins and w:del ...

WHY NOT ROLL YOUR OWN?

The good news is that you don't have to roll your own. The OpenXmlRegex utility class of Eric White's Open-Xml-PowerTools implements the above algorithm (and more). I've successfully used it in large-scale RFP and contracting scenarios and also contributed back to it.

HOW TO USE THE OPEN-XML-POWERTOOLS?

In this section, I'm going to demonstrate how to use the Open-Xml-PowerTools to replace the placeholder text "Firstname" (as in the question) with various first names (using "Bernie" in the sample output document).

Sample Input Document

Let's first look at the following sample document, which is created by the unit test shown a little later. Note that we have formatted runs and a symbol. As in the question, the placeholder "Firstname" is split into two runs, i.e., "F" and "irstname".

<?xml version="1.0" encoding="utf-8"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <w:p>
      <w:r>
        <w:rPr>
          <w:i />
        </w:rPr>
        <w:t xml:space="preserve">Hello </w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:b />
        </w:rPr>
        <w:t>F</w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:b />
        </w:rPr>
        <w:t>irstname</w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve"> </w:t>
      </w:r>
      <w:r>
        <w:sym w:font="Wingdings" w:char="F04A" />
      </w:r>
    </w:p>
  </w:body>
</w:document>

Desired Output Document

The following is the document resulting from replacing "Firstname" with "Bernie" if you do it right. Note that the formatting is retained and that we did not lose our symbol.

<?xml version="1.0" encoding="utf-8"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <w:p>
      <w:r>
        <w:rPr>
          <w:i />
        </w:rPr>
        <w:t xml:space="preserve">Hello </w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:b />
        </w:rPr>
        <w:t>Bernie</w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve"> </w:t>
      </w:r>
      <w:r>
        <w:sym w:font="Wingdings" w:char="F04A" />
      </w:r>
    </w:p>
  </w:body>
</w:document>

Sample Usage

Next, here's a full unit test that demonstrates how to use the OpenXmlRegex.Replace() method, noting that the example only shows one of the multiple overloads. The unit test also demonstrates that this works:

  • regardless of how the placeholder (e.g., "Firstname") is split across one or more runs;
  • while retaining the formatting of the placeholder;
  • without losing the formatting of other runs; and
  • without losing symbols (or any other markup such as fields or content controls).
[Theory]
[InlineData("1 Run", "Firstname", new[] { "Firstname" }, "Albert")]
[InlineData("2 Runs", "Firstname", new[] { "F", "irstname" }, "Bernie")]
[InlineData("9 Runs", "Firstname", new[] { "F", "i", "r", "s", "t", "n", "a", "m", "e" }, "Charly")]
public void Replace_PlaceholderInOneOrMoreRuns_SuccessfullyReplaced(
    string example,
    string propName,
    IEnumerable<string> runTexts,
    string replacement)
{
    // Create a test WordprocessingDocument on a MemoryStream.
    using MemoryStream stream = CreateWordprocessingDocument(runTexts);

    // Save the Word document before replacing the placeholder.
    // You can use this to inspect the input Word document.
    File.WriteAllBytes($"{example} before Replacing.docx", stream.ToArray());

    // Replace the placeholder identified by propName with the replacement text.
    using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, true))
    {
        // Read the root element, a w:document in this case.
        // Note that GetXElement() is a shortcut for GetXDocument().Root.
        // This caches the root element and we can later write it back
        // to the main document part, using the PutXDocument() method.
        XElement document = wordDocument.MainDocumentPart.GetXElement();

        // Specify the parameters of the OpenXmlRegex.Replace() method,
        // noting that the replacement is given as a parameter.
        IEnumerable<XElement> content = document.Descendants(W.p);
        var regex = new Regex(propName);

        // Perform the replacement, thereby modifying the root element.
        OpenXmlRegex.Replace(content, regex, replacement, null);

        // Write the changed root element back to the main document part.
        wordDocument.MainDocumentPart.PutXDocument();
    }

    // Assert that we have done it right.
    AssertReplacementWasSuccessful(stream, replacement);

    // Save the Word document after having replaced the placeholder.
    // You can use this to inspect the output Word document.
    File.WriteAllBytes($"{example} after Replacing.docx", stream.ToArray());
}

private static MemoryStream CreateWordprocessingDocument(IEnumerable<string> runTexts)
{
    var stream = new MemoryStream();
    const WordprocessingDocumentType type = WordprocessingDocumentType.Document;

    using (WordprocessingDocument wordDocument = WordprocessingDocument.Create(stream, type))
    {
        MainDocumentPart mainDocumentPart = wordDocument.AddMainDocumentPart();
        mainDocumentPart.PutXDocument(new XDocument(CreateDocument(runTexts)));
    }

    return stream;
}

private static XElement CreateDocument(IEnumerable<string> runTexts)
{
    // Produce a w:document with a single w:p that contains:
    // (1) one italic run with some lead-in, i.e., "Hello " in this example;
    // (2) one or more bold runs for the placeholder, which might or might not be split;
    // (3) one run with just a space; and
    // (4) one run with a symbol (i.e., a Wingdings smiley face).
    return new XElement(W.document,
        new XAttribute(XNamespace.Xmlns + "w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main"),
        new XElement(W.body,
            new XElement(W.p,
                new XElement(W.r,
                    new XElement(W.rPr,
                        new XElement(W.i)),
                    new XElement(W.t,
                        new XAttribute(XNamespace.Xml + "space", "preserve"),
                        "Hello ")),
                runTexts.Select(rt =>
                    new XElement(W.r,
                        new XElement(W.rPr,
                            new XElement(W.b)),
                        new XElement(W.t, rt))),
                new XElement(W.r,
                    new XElement(W.t,
                        new XAttribute(XNamespace.Xml + "space", "preserve"),
                        " ")),
                new XElement(W.r,
                    new XElement(W.sym,
                        new XAttribute(W.font, "Wingdings"),
                        new XAttribute(W._char, "F04A"))))));
}

private static void AssertReplacementWasSuccessful(MemoryStream stream, string replacement)
{
    using WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false);

    XElement document = wordDocument.MainDocumentPart.GetXElement();
    XElement paragraph = document.Descendants(W.p).Single();
    List<XElement> runs = paragraph.Elements(W.r).ToList();

    // We have the expected number of runs, i.e., the lead-in, the first name,
    // a space character, and the symbol.
    Assert.Equal(4, runs.Count);

    // We still have the lead-in "Hello " and it is still formatted in italics.
    Assert.True(runs[0].Value == "Hello " && runs[0].Elements(W.rPr).Elements(W.i).Any());

    // We have successfully replaced our "Firstname" placeholder and the
    // concrete first name is formatted in bold, exactly like the placeholder.
    Assert.True(runs[1].Value == replacement && runs[1].Elements(W.rPr).Elements(W.b).Any());

    // We still have the space between the first name and the symbol and it
    // is unformatted.
    Assert.True(runs[2].Value == " " && !runs[2].Elements(W.rPr).Any());

    // Finally, we still have our smiley face symbol run.
    Assert.True(IsSymbolRun(runs[3], "Wingdings", "F04A"));
}

private static bool IsSymbolRun(XElement run, string fontValue, string charValue)
{
    XElement sym = run.Elements(W.sym).FirstOrDefault();
    if (sym == null) return false;

    return (string) sym.Attribute(W.font) == fontValue &&
           (string) sym.Attribute(W._char) == charValue;
}

WHY IS INNERTEXT NOT THE SOLUTION?

While it might be tempting to use the InnerText property of the Paragraph class (or other subclasses of the OpenXmlElement class), the problem is that you will be ignoring any non-text (w:t) markup. For example, if your paragraph contains symbols (w:sym elements, e.g., the smiley face used in the example above), those will be lost because they are not considered by the InnerText property. The following unit test demonstrates that:

[Theory]
[InlineData("Hello Firstname ", new[] { "Firstname" })]
[InlineData("Hello Firstname ", new[] { "F", "irstname" })]
[InlineData("Hello Firstname ", new[] { "F", "i", "r", "s", "t", "n", "a", "m", "e" })]
public void InnerText_ParagraphWithSymbols_SymbolIgnored(string expectedInnerText, IEnumerable<string> runTexts)
{
    // Create Word document with smiley face symbol at the end.
    using MemoryStream stream = CreateWordprocessingDocument(runTexts);
    using WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false);

    Document document = wordDocument.MainDocumentPart.Document;
    Paragraph paragraph = document.Descendants<Paragraph>().Single();

    string innerText = paragraph.InnerText;

    // Note that the innerText does not contain the smiley face symbol.
    Assert.Equal(expectedInnerText, innerText);
}

Note that you might not need to consider all of the above in simple use cases. But if you must deal with real-life documents or the markup changes made by Microsoft Word, chances are you can't ignore the complexity. And wait until you need to deal with revision markup ...

As always, the full source code can be found in my CodeSnippets GitHub repository. Look for the OpenXmlRegexTests class.

Eldoneldora answered 13/12, 2019 at 19:16 Comment(6)
Dang! Very helpful, saving me a bunch of time as I get going.Animalist
Sorry but this is way too much info, for something that should be really simple to do. Open Xaml maybe a powerfull tool, I still need to use it, but I wish I did not had to reall that extra information just to make one function work. Just give me the essentials I do not need to know that Word is made in markup,etc just get to the point of which code you can use.Pavel
@ThunderSpark, you could have stopped reading right after the TLDR. You would not understand why you should be doing what it suggests. But you would know what to do.Eldoneldora
Is there another method as I could not use .net-related tools?Mellicent
@Mahan, this depends on your tech stack. If you want to process Open XML markup contained in a .docx file, you'll need to find a tool or library that lets you do that on your tech stack.Eldoneldora
I wrote the library (not completely, but the parts I needed). Now I'm using my TOoxml class with joy!Mellicent

© 2022 - 2024 — McMap. All rights reserved.