TLDR
In short words, the solution to your problem is to use the OpenXmlRegex
utility class of the Open-Xml-PowerTools as demonstrated in the unit test further below.
WHY?
Using Open XML, you can represent the same text in multiple ways. If Microsoft Word is involved in creating that Open XML markup, the edits made to produce that text will play an important part. This is because Word keeps track of which edits were made in which editing session. So, for example, the w:p
(Paragraph
) elements shown in the following extreme scenarios represent precisely the same text. And anything between those two examples is possible, so any real solution must be able to deal with that.
Extreme Scenario 1: Single w:r
and w:t
Element
The following markup is nice and easy:
<w:p>
<w:r>
<w:t>Firstname</w:t>
</w:r>
</w:p>
Extreme Scenario 2: Single-Character w:r
and w:t
Elements
While you typically won't find the following markup, it represents the theoretical extreme in which each character has its own w:r
and w:t
element.
<w:p>
<w:r>
<w:t>F</w:t>
<w:t>i</w:t>
<w:t>r</w:t>
<w:t>s</w:t>
<w:t>t</w:t>
<w:t>n</w:t>
<w:t>a</w:t>
<w:t>m</w:t>
<w:t>e</w:t>
</w:r>
</w:p>
Why did I use this extreme example if it does not occur in practice, you might ask? The answer is that it plays an essential role in the solution in case you want to roll your own.
HOW TO ROLL YOUR OWN?
To do it right, you must:
- transform the runs (
w:r
) of your paragraph (w:p
) into single-character runs (i.e., w:r
elements with one single-character w:t
or one w:sym
each), retaining the run properties (w:rPr
);
- perform the search-and-replace operation on those single-character runs (using a few other tricks); and
- considering the potentially different run properties (
w:rPr
) of the runs resulting from the search-and-replace action, transform such resulting runs back into the fewest number of "coalesced" runs required to represent the text and its formatting.
When replacing text, you should not lose or alter the formatting of the text that is unaffected by your replacement. You should also not remove unaffected fields or content controls (w:sdt
). Ah, and by the way, don't forget revision markup such as w:ins
and w:del
...
WHY NOT ROLL YOUR OWN?
The good news is that you don't have to roll your own. The OpenXmlRegex
utility class of Eric White's Open-Xml-PowerTools implements the above algorithm (and more). I've successfully used it in large-scale RFP and contracting scenarios and also contributed back to it.
HOW TO USE THE OPEN-XML-POWERTOOLS?
In this section, I'm going to demonstrate how to use the Open-Xml-PowerTools to replace the placeholder text "Firstname" (as in the question) with various first names (using "Bernie" in the sample output document).
Sample Input Document
Let's first look at the following sample document, which is created by the unit test shown a little later. Note that we have formatted runs and a symbol. As in the question, the placeholder "Firstname" is split into two runs, i.e., "F" and "irstname".
<?xml version="1.0" encoding="utf-8"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:body>
<w:p>
<w:r>
<w:rPr>
<w:i />
</w:rPr>
<w:t xml:space="preserve">Hello </w:t>
</w:r>
<w:r>
<w:rPr>
<w:b />
</w:rPr>
<w:t>F</w:t>
</w:r>
<w:r>
<w:rPr>
<w:b />
</w:rPr>
<w:t>irstname</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:r>
<w:sym w:font="Wingdings" w:char="F04A" />
</w:r>
</w:p>
</w:body>
</w:document>
Desired Output Document
The following is the document resulting from replacing "Firstname" with "Bernie" if you do it right. Note that the formatting is retained and that we did not lose our symbol.
<?xml version="1.0" encoding="utf-8"?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:body>
<w:p>
<w:r>
<w:rPr>
<w:i />
</w:rPr>
<w:t xml:space="preserve">Hello </w:t>
</w:r>
<w:r>
<w:rPr>
<w:b />
</w:rPr>
<w:t>Bernie</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:r>
<w:sym w:font="Wingdings" w:char="F04A" />
</w:r>
</w:p>
</w:body>
</w:document>
Sample Usage
Next, here's a full unit test that demonstrates how to use the OpenXmlRegex.Replace()
method, noting that the example only shows one of the multiple overloads. The unit test also demonstrates that this works:
- regardless of how the placeholder (e.g., "Firstname") is split across one or more runs;
- while retaining the formatting of the placeholder;
- without losing the formatting of other runs; and
- without losing symbols (or any other markup such as fields or content controls).
[Theory]
[InlineData("1 Run", "Firstname", new[] { "Firstname" }, "Albert")]
[InlineData("2 Runs", "Firstname", new[] { "F", "irstname" }, "Bernie")]
[InlineData("9 Runs", "Firstname", new[] { "F", "i", "r", "s", "t", "n", "a", "m", "e" }, "Charly")]
public void Replace_PlaceholderInOneOrMoreRuns_SuccessfullyReplaced(
string example,
string propName,
IEnumerable<string> runTexts,
string replacement)
{
// Create a test WordprocessingDocument on a MemoryStream.
using MemoryStream stream = CreateWordprocessingDocument(runTexts);
// Save the Word document before replacing the placeholder.
// You can use this to inspect the input Word document.
File.WriteAllBytes($"{example} before Replacing.docx", stream.ToArray());
// Replace the placeholder identified by propName with the replacement text.
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, true))
{
// Read the root element, a w:document in this case.
// Note that GetXElement() is a shortcut for GetXDocument().Root.
// This caches the root element and we can later write it back
// to the main document part, using the PutXDocument() method.
XElement document = wordDocument.MainDocumentPart.GetXElement();
// Specify the parameters of the OpenXmlRegex.Replace() method,
// noting that the replacement is given as a parameter.
IEnumerable<XElement> content = document.Descendants(W.p);
var regex = new Regex(propName);
// Perform the replacement, thereby modifying the root element.
OpenXmlRegex.Replace(content, regex, replacement, null);
// Write the changed root element back to the main document part.
wordDocument.MainDocumentPart.PutXDocument();
}
// Assert that we have done it right.
AssertReplacementWasSuccessful(stream, replacement);
// Save the Word document after having replaced the placeholder.
// You can use this to inspect the output Word document.
File.WriteAllBytes($"{example} after Replacing.docx", stream.ToArray());
}
private static MemoryStream CreateWordprocessingDocument(IEnumerable<string> runTexts)
{
var stream = new MemoryStream();
const WordprocessingDocumentType type = WordprocessingDocumentType.Document;
using (WordprocessingDocument wordDocument = WordprocessingDocument.Create(stream, type))
{
MainDocumentPart mainDocumentPart = wordDocument.AddMainDocumentPart();
mainDocumentPart.PutXDocument(new XDocument(CreateDocument(runTexts)));
}
return stream;
}
private static XElement CreateDocument(IEnumerable<string> runTexts)
{
// Produce a w:document with a single w:p that contains:
// (1) one italic run with some lead-in, i.e., "Hello " in this example;
// (2) one or more bold runs for the placeholder, which might or might not be split;
// (3) one run with just a space; and
// (4) one run with a symbol (i.e., a Wingdings smiley face).
return new XElement(W.document,
new XAttribute(XNamespace.Xmlns + "w", "http://schemas.openxmlformats.org/wordprocessingml/2006/main"),
new XElement(W.body,
new XElement(W.p,
new XElement(W.r,
new XElement(W.rPr,
new XElement(W.i)),
new XElement(W.t,
new XAttribute(XNamespace.Xml + "space", "preserve"),
"Hello ")),
runTexts.Select(rt =>
new XElement(W.r,
new XElement(W.rPr,
new XElement(W.b)),
new XElement(W.t, rt))),
new XElement(W.r,
new XElement(W.t,
new XAttribute(XNamespace.Xml + "space", "preserve"),
" ")),
new XElement(W.r,
new XElement(W.sym,
new XAttribute(W.font, "Wingdings"),
new XAttribute(W._char, "F04A"))))));
}
private static void AssertReplacementWasSuccessful(MemoryStream stream, string replacement)
{
using WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false);
XElement document = wordDocument.MainDocumentPart.GetXElement();
XElement paragraph = document.Descendants(W.p).Single();
List<XElement> runs = paragraph.Elements(W.r).ToList();
// We have the expected number of runs, i.e., the lead-in, the first name,
// a space character, and the symbol.
Assert.Equal(4, runs.Count);
// We still have the lead-in "Hello " and it is still formatted in italics.
Assert.True(runs[0].Value == "Hello " && runs[0].Elements(W.rPr).Elements(W.i).Any());
// We have successfully replaced our "Firstname" placeholder and the
// concrete first name is formatted in bold, exactly like the placeholder.
Assert.True(runs[1].Value == replacement && runs[1].Elements(W.rPr).Elements(W.b).Any());
// We still have the space between the first name and the symbol and it
// is unformatted.
Assert.True(runs[2].Value == " " && !runs[2].Elements(W.rPr).Any());
// Finally, we still have our smiley face symbol run.
Assert.True(IsSymbolRun(runs[3], "Wingdings", "F04A"));
}
private static bool IsSymbolRun(XElement run, string fontValue, string charValue)
{
XElement sym = run.Elements(W.sym).FirstOrDefault();
if (sym == null) return false;
return (string) sym.Attribute(W.font) == fontValue &&
(string) sym.Attribute(W._char) == charValue;
}
WHY IS INNERTEXT NOT THE SOLUTION?
While it might be tempting to use the InnerText
property of the Paragraph
class (or other subclasses of the OpenXmlElement
class), the problem is that you will be ignoring any non-text (w:t
) markup. For example, if your paragraph contains symbols (w:sym
elements, e.g., the smiley face used in the example above), those will be lost because they are not considered by the InnerText
property. The following unit test demonstrates that:
[Theory]
[InlineData("Hello Firstname ", new[] { "Firstname" })]
[InlineData("Hello Firstname ", new[] { "F", "irstname" })]
[InlineData("Hello Firstname ", new[] { "F", "i", "r", "s", "t", "n", "a", "m", "e" })]
public void InnerText_ParagraphWithSymbols_SymbolIgnored(string expectedInnerText, IEnumerable<string> runTexts)
{
// Create Word document with smiley face symbol at the end.
using MemoryStream stream = CreateWordprocessingDocument(runTexts);
using WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false);
Document document = wordDocument.MainDocumentPart.Document;
Paragraph paragraph = document.Descendants<Paragraph>().Single();
string innerText = paragraph.InnerText;
// Note that the innerText does not contain the smiley face symbol.
Assert.Equal(expectedInnerText, innerText);
}
Note that you might not need to consider all of the above in simple use cases. But if you must deal with real-life documents or the markup changes made by Microsoft Word, chances are you can't ignore the complexity. And wait until you need to deal with revision markup ...
As always, the full source code can be found in my CodeSnippets GitHub repository. Look for the OpenXmlRegexTests class.
w:rPr
), symbols (w:sym
), fields (e.g., REF), and content controls (w:sdt
), for example. – Eldoneldora