Open XML - find and replace multiple placeholders in document template [duplicate]

I know there are many posts on SO about this topic, but none seems to treat this particular issue. I'm trying to make a small generic document generator POC. I'm using Open XML.

The code goes like this:

   private static void ReplacePlaceholders<T>(string templateDocumentPath, T templateObject)
        where T : class
    {

        using (var templateDocument = WordprocessingDocument.Open(templateDocumentPath, true))
        {
            string templateDocumentText = null;
            using (var streamReader = new StreamReader(templateDocument.MainDocumentPart.GetStream()))
            {
                templateDocumentText = streamReader.ReadToEnd();
            }

            var props = templateObject.GetType().GetProperties();
            foreach (var prop in props)
            {
                var regexText = new Regex($"{prop.Name}");
                templateDocumentText =
                    regexText.Replace(templateDocumentText, prop.GetValue(templateObject).ToString());
            }

            using var streamWriter = new StreamWriter(templateDocument.MainDocumentPart.GetStream(FileMode.Create));
                streamWriter.Write(templateDocumentText);
        }
    }

The code works as intended. Problem is the following:

StreamReader.ReadToEnd() splits my placeholders between tags, so my Replace method, replaces only the words which won't get split.

In this case, my code will search for the word "Firstname" but will find "irstname" instead, so it won't replace it.

Is there any way to scan the whole .docx word by word and replace them?

(edit) A partial solution / workaround I found: - I noticed that you have to write the placeholder in the .docx at once (without reediting it). For example if I write "firstname", then come back and modify it to "Firstname" it will split the word into "F" "irstname". Without editng it will be unsplitted.

TLDR

In short words, the solution to your problem is to use the OpenXmlRegex utility class of the Open-Xml-PowerTools as demonstrated in the unit test further below.

WHY?

Using Open XML, you can represent the same text in multiple ways. If Microsoft Word is involved in creating that Open XML markup, the edits made to produce that text will play an important part. This is because Word keeps track of which edits were made in which editing session. So, for example, the w:p (Paragraph) elements shown in the following extreme scenarios represent precisely the same text. And anything between those two examples is possible, so any real solution must be able to deal with that.

Extreme Scenario 1: Single `w:r` and `w:t` Element

The following markup is nice and easy:

<w:p>
  <w:r>
    <w:t>Firstname</w:t>
  </w:r>
</w:p>

Extreme Scenario 2: Single-Character `w:r` and `w:t` Elements

While you typically won't find the following markup, it represents the theoretical extreme in which each character has its own w:r and w:t element.

<w:p>
  <w:r>
    <w:t>F</w:t>
    <w:t>i</w:t>
    <w:t>r</w:t>
    <w:t>s</w:t>
    <w:t>t</w:t>
    <w:t>n</w:t>
    <w:t>a</w:t>
    <w:t>m</w:t>
    <w:t>e</w:t>
  </w:r>
</w:p>

Why did I use this extreme example if it does not occur in practice, you might ask? The answer is that it plays an essential role in the solution in case you want to roll your own.

HOW TO ROLL YOUR OWN?

To do it right, you must:

transform the runs (w:r) of your paragraph (w:p) into single-character runs (i.e., w:r elements with one single-character w:t or one w:sym each), retaining the run properties (w:rPr);
perform the search-and-replace operation on those single-character runs (using a few other tricks); and
considering the potentially different run properties (w:rPr) of the runs resulting from the search-and-replace action, transform such resulting runs back into the fewest number of "coalesced" runs required to represent the text and its formatting.

When replacing text, you should not lose or alter the formatting of the text that is unaffected by your replacement. You should also not remove unaffected fields or content controls (w:sdt). Ah, and by the way, don't forget revision markup such as w:ins and w:del ...

WHY NOT ROLL YOUR OWN?

The good news is that you don't have to roll your own. The OpenXmlRegex utility class of Eric White's Open-Xml-PowerTools implements the above algorithm (and more). I've successfully used it in large-scale RFP and contracting scenarios and also contributed back to it.

HOW TO USE THE OPEN-XML-POWERTOOLS?

In this section, I'm going to demonstrate how to use the Open-Xml-PowerTools to replace the placeholder text "Firstname" (as in the question) with various first names (using "Bernie" in the sample output document).

Sample Input Document

Let's first look at the following sample document, which is created by the unit test shown a little later. Note that we have formatted runs and a symbol. As in the question, the placeholder "Firstname" is split into two runs, i.e., "F" and "irstname".

<?xml version="1.0" encoding="utf-8"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <w:p>
      <w:r>
        <w:rPr>
          <w:i />
        </w:rPr>
        <w:t xml:space="preserve">Hello </w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:b />
        </w:rPr>
        <w:t>F</w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:b />
        </w:rPr>
        <w:t>irstname</w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve"> </w:t>
      </w:r>
      <w:r>
        <w:sym w:font="Wingdings" w:char="F04A" />
      </w:r>
    </w:p>
  </w:body>
</w:document>

Desired Output Document

The following is the document resulting from replacing "Firstname" with "Bernie" if you do it right. Note that the formatting is retained and that we did not lose our symbol.

<?xml version="1.0" encoding="utf-8"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <w:p>
      <w:r>
        <w:rPr>
          <w:i />
        </w:rPr>
        <w:t xml:space="preserve">Hello </w:t>
      </w:r>
      <w:r>
        <w:rPr>
          <w:b />
        </w:rPr>
        <w:t>Bernie</w:t>
      </w:r>
      <w:r>
        <w:t xml:space="preserve"> </w:t>
      </w:r>
      <w:r>
        <w:sym w:font="Wingdings" w:char="F04A" />
      </w:r>
    </w:p>
  </w:body>
</w:document>

Sample Usage

Next, here's a full unit test that demonstrates how to use the OpenXmlRegex.Replace() method, noting that the example only shows one of the multiple overloads. The unit test also demonstrates that this works:

regardless of how the placeholder (e.g., "Firstname") is split across one or more runs;
while retaining the formatting of the placeholder;
without losing the formatting of other runs; and
without losing symbols (or any other markup such as fields or content controls).

[Theory]
[InlineData("1 Run", "Firstname", new[] { "Firstname" }, "Albert")]
[InlineData("2 Runs", "Firstname", new[] { "F", "irstname" }, "Bernie")]
[InlineData("9 Runs", "Firstname", new[] { "F", "i", "r", "s", "t", "n", "a", "m", "e" }, "Charly")]
public void Replace_PlaceholderInOneOrMoreRuns_SuccessfullyReplaced(
    string example,
    string propName,
    IEnumerable<string> runTexts,
    string replacement)
{
    // Create a test WordprocessingDocument on a MemoryStream.
    using MemoryStream stream = CreateWordprocessingDocument(runTexts);

    // Save the Word document before replacing the placeholder.
    // You can use this to inspect the input Word document.
    File.WriteAllBytes($"{example} before Replacing.docx", stream.ToArray());

    // Replace the placeholder identified by propName with the replacement text.
    using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, true))
    {
        // Read the root element, a w:document in this case.
        // Note that GetXElement() is a shortcut for GetXDocument().Root.
        // This caches the root element and we can later write it back
        // to the main document part, using the PutXDocument() method.
        XElement document = wordDocument.MainDocumentPart.GetXElement();

        // Specify the parameters of the OpenXmlRegex.Replace() method,
        // noting that the replacement is given as a parameter.
        IEnumerable<XElement> content = document.Descendants(W.p);
        var regex = new Regex(propName);

        // Perform the replacement, thereby modifying the root element.
        OpenXmlRegex.Replace(content, regex, replacement, null);

        // Write the changed root element back to the main document part.
        wordDocument.MainDocumentPart.PutXDocument();
    }

    // Assert that we have done it right.
    AssertReplacementWasSuccessful(stream, replacement);

    // Save the Word document after having replaced the placeholder.
    // You can use this to inspect the output Word document.
    File.WriteAllBytes($"{example} after Replacing.docx", stream.ToArray());
}

private static MemoryStream CreateWordprocessingDocument(IEnumerable<string> runTexts)
{
    var stream = new MemoryStream();
    const WordprocessingDocumentType type = WordprocessingDocumentType.Document;

    using (WordprocessingDocument wordDocument = WordprocessingDocument.Create(stream, type))
    {
        MainDocumentPart mainDocumentPart = wordDocument.AddMainDocumentPart();
        mainDocumentPart.PutXDocument(new XDocument(CreateDocument(runTexts)));
    }

    return stream;
}

private static XElement CreateDocument(IEnumerable<string> runTexts)
{
    // Produce a w:document with a single w:p that contains:
    // (1) one italic run with some lead-in, i.e., "Hello " in this example;
    // (2) one or more bold runs for the placeholder, which might or might not be split;
    // (3) one run with just a space; and
    // (4) one run with a symbol (i.e., a Wingdings smiley face).
    return new XElement(W.document,
        new XAttribute(XNamespace.Xmlns + "w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main"),
        new XElement(W.body,
            new XElement(W.p,
                new XElement(W.r,
                    new XElement(W.rPr,
                        new XElement(W.i)),
                    new XElement(W.t,
                        new XAttribute(XNamespace.Xml + "space", "preserve"),
                        "Hello ")),
                runTexts.Select(rt =>
                    new XElement(W.r,
                        new XElement(W.rPr,
                            new XElement(W.b)),
                        new XElement(W.t, rt))),
                new XElement(W.r,
                    new XElement(W.t,
                        new XAttribute(XNamespace.Xml + "space", "preserve"),
                        " ")),
                new XElement(W.r,
                    new XElement(W.sym,
                        new XAttribute(W.font, "Wingdings"),
                        new XAttribute(W._char, "F04A"))))));
}

private static void AssertReplacementWasSuccessful(MemoryStream stream, string replacement)
{
    using WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false);

    XElement document = wordDocument.MainDocumentPart.GetXElement();
    XElement paragraph = document.Descendants(W.p).Single();
    List<XElement> runs = paragraph.Elements(W.r).ToList();

    // We have the expected number of runs, i.e., the lead-in, the first name,
    // a space character, and the symbol.
    Assert.Equal(4, runs.Count);

    // We still have the lead-in "Hello " and it is still formatted in italics.
    Assert.True(runs[0].Value == "Hello " && runs[0].Elements(W.rPr).Elements(W.i).Any());

    // We have successfully replaced our "Firstname" placeholder and the
    // concrete first name is formatted in bold, exactly like the placeholder.
    Assert.True(runs[1].Value == replacement && runs[1].Elements(W.rPr).Elements(W.b).Any());

    // We still have the space between the first name and the symbol and it
    // is unformatted.
    Assert.True(runs[2].Value == " " && !runs[2].Elements(W.rPr).Any());

    // Finally, we still have our smiley face symbol run.
    Assert.True(IsSymbolRun(runs[3], "Wingdings", "F04A"));
}

private static bool IsSymbolRun(XElement run, string fontValue, string charValue)
{
    XElement sym = run.Elements(W.sym).FirstOrDefault();
    if (sym == null) return false;

    return (string) sym.Attribute(W.font) == fontValue &&
           (string) sym.Attribute(W._char) == charValue;
}

WHY IS INNERTEXT NOT THE SOLUTION?

While it might be tempting to use the InnerText property of the Paragraph class (or other subclasses of the OpenXmlElement class), the problem is that you will be ignoring any non-text (w:t) markup. For example, if your paragraph contains symbols (w:sym elements, e.g., the smiley face used in the example above), those will be lost because they are not considered by the InnerText property. The following unit test demonstrates that:

[Theory]
[InlineData("Hello Firstname ", new[] { "Firstname" })]
[InlineData("Hello Firstname ", new[] { "F", "irstname" })]
[InlineData("Hello Firstname ", new[] { "F", "i", "r", "s", "t", "n", "a", "m", "e" })]
public void InnerText_ParagraphWithSymbols_SymbolIgnored(string expectedInnerText, IEnumerable<string> runTexts)
{
    // Create Word document with smiley face symbol at the end.
    using MemoryStream stream = CreateWordprocessingDocument(runTexts);
    using WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false);

    Document document = wordDocument.MainDocumentPart.Document;
    Paragraph paragraph = document.Descendants<Paragraph>().Single();

    string innerText = paragraph.InnerText;

    // Note that the innerText does not contain the smiley face symbol.
    Assert.Equal(expectedInnerText, innerText);
}

Note that you might not need to consider all of the above in simple use cases. But if you must deal with real-life documents or the markup changes made by Microsoft Word, chances are you can't ignore the complexity. And wait until you need to deal with revision markup ...

As always, the full source code can be found in my CodeSnippets GitHub repository. Look for the OpenXmlRegexTests class.

TLDR

WHY?

Extreme Scenario 1: Single `w:r` and `w:t` Element

Extreme Scenario 2: Single-Character `w:r` and `w:t` Elements

HOW TO ROLL YOUR OWN?

WHY NOT ROLL YOUR OWN?

HOW TO USE THE OPEN-XML-POWERTOOLS?

Sample Input Document

Desired Output Document

Sample Usage

WHY IS INNERTEXT NOT THE SOLUTION?

Recommended topics

Hot tags

TLDR

WHY?

Extreme Scenario 1: Single w:r and w:t Element

Extreme Scenario 2: Single-Character w:r and w:t Elements

HOW TO ROLL YOUR OWN?

WHY NOT ROLL YOUR OWN?

HOW TO USE THE OPEN-XML-POWERTOOLS?

Sample Input Document

Desired Output Document

Sample Usage

WHY IS INNERTEXT NOT THE SOLUTION?

Recommended topics

Hot tags

Extreme Scenario 1: Single `w:r` and `w:t` Element

Extreme Scenario 2: Single-Character `w:r` and `w:t` Elements