OpenXML replace text in all document
Asked Answered
M

3

11

I have the piece of code below. I'd like replace the text "Text1" by "NewText", that's work. But when I place the text "Text1" in a table that's not work anymore for the "Text1" inside the table.

I'd like make this replacement in the all document.

using (WordprocessingDocument doc = WordprocessingDocument.Open(String.Format("c:\\temp\\filename.docx"), true))
{
    var body = doc.MainDocumentPart.Document.Body;

    foreach (var para in body.Elements<Paragraph>())
    {
        foreach (var run in para.Elements<Run>())
        {
            foreach (var text in run.Elements<Text>())
            {
                if (text.Text.Contains("##Text1##"))
                    text.Text = text.Text.Replace("##Text1##", "NewText");
            }
        }
    }
}
Merozoite answered 30/9, 2013 at 12:32 Comment(0)
S
16

Your code does not work because the table element (w:tbl) is not contained in a paragraph element (w:p). See the following MSDN article for more information.

The Text class (serialized as w:t) usually represents literal text within a Run element in a word document. So you could simply search for all w:t elements (Text class) and replace your tag if the text element (w:t) contains your tag:

using (WordprocessingDocument doc = WordprocessingDocument.Open("yourdoc.docx", true))
{
  var body = doc.MainDocumentPart.Document.Body;

  foreach (var text in body.Descendants<Text>())
  {
    if (text.Text.Contains("##Text1##"))
    {
      text.Text = text.Text.Replace("##Text1##", "NewText");
    }
  }
}
Saied answered 30/9, 2013 at 17:23 Comment(4)
Note that this answer and all the other answers that just grab a <Text> block mostly work but they're not very reliable. There are a lot of things in OpenXml that can break the text up. Applying formatting to part of the word, bookmarks, etc. all break up the text. The code at msdn.microsoft.com/en-us/library/… supposedly fixes it, but I haven't made it work yet so can't report success or failure. In my particular sample documents, about 1 word out of 100-200 gets broken up.Nanette
@WadeHatler: Thank you for your comment. I will have a look at the code provided in your link.Saied
Happy to help. I've mostly concluded that I hate OpenXml. I found some code that almost works at blogs.msdn.com/b/ericwhite/archive/2008/07/09/…, blogs.msdn.com/b/ericwhite/archive/2008/03/14/… and blogs.msdn.com/b/ericwhite/archive/2009/02/16/…. It's still unreliable because it can't figure out when to put in spaces so I get fragments. I'll post answer if I get it working right.Nanette
How can I solve the error type or namespace 'Text' could not be found in line var text in body.Descendants<Text>()?Skydive
S
12

Borrowing on some other answers in various places, and with the fact that four main obstacles must be overcome:

  1. Delete any high level Unicode chars from your replace string that cannot be read from Word (from bad user input)
  2. Ability to search for your find result across multiple runs or text elements within a paragraph (Word will often break up a single sentence into several text runs)
  3. Ability to include a line break in your replace text so as to insert multi-line text into the document.
  4. Ability to pass in any node as the starting point for your search so as to restrict the search to that part of the document (such as the body, the header, the footer, a specific table, table row, or tablecell).

I am sure advanced scenarios such as bookmarks, complex nesting will need more modification on this, but it is working for the types of basic word documents I have run into so far, and is much more helpful to me than disregarding runs altogether or using a RegEx on the entire file with no ability to target a specific TableCell or Document part (for advanced scenarios).

Example Usage:

 var body = document.MainDocumentPart.Document.Body;
 ReplaceText(body, replace, with);

The code:

using System;
using System.Collections.Generic;
using System.Linq;
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;

namespace My.Web.Api.OpenXml
{
    public static class WordTools
    {


/// <summary>
        /// Find/replace within the specified paragraph.
        /// </summary>
        /// <param name="paragraph"></param>
        /// <param name="find"></param>
        /// <param name="replaceWith"></param>
        public static void ReplaceText(Paragraph paragraph, string find, string replaceWith)
        {
            var texts = paragraph.Descendants<Text>();
            for (int t = 0; t < texts.Count(); t++)
            {   // figure out which Text element within the paragraph contains the starting point of the search string
                Text txt = texts.ElementAt(t);
                for (int c = 0; c < txt.Text.Length; c++)
                {
                    var match = IsMatch(texts, t, c, find);
                    if (match != null)
                    {   // now replace the text
                        string[] lines = replaceWith.Replace(Environment.NewLine, "\r").Split('\n', '\r'); // handle any lone n/r returns, plus newline.

                        int skip = lines[lines.Length - 1].Length - 1; // will jump to end of the replacement text, it has been processed.

                        if (c > 0)
                            lines[0] = txt.Text.Substring(0, c) + lines[0];  // has a prefix
                        if (match.EndCharIndex + 1 < texts.ElementAt(match.EndElementIndex).Text.Length)
                            lines[lines.Length - 1] = lines[lines.Length - 1] + texts.ElementAt(match.EndElementIndex).Text.Substring(match.EndCharIndex + 1);

                        txt.Space = new EnumValue<SpaceProcessingModeValues>(SpaceProcessingModeValues.Preserve); // in case your value starts/ends with whitespace
                        txt.Text = lines[0];

                        // remove any extra texts.
                        for (int i = t + 1; i <= match.EndElementIndex; i++)
                        {
                            texts.ElementAt(i).Text = string.Empty; // clear the text
                        }

                        // if 'with' contained line breaks we need to add breaks back...
                        if (lines.Count() > 1)
                        {
                            OpenXmlElement currEl = txt;
                            Break br;

                            // append more lines
                            var run = txt.Parent as Run;
                            for (int i = 1; i < lines.Count(); i++)
                            {
                                br = new Break();
                                run.InsertAfter<Break>(br, currEl);
                                currEl = br;
                                txt = new Text(lines[i]);
                                run.InsertAfter<Text>(txt, currEl);
                                t++; // skip to this next text element
                                currEl = txt;
                            }
                            c = skip; // new line
                        }
                        else
                        {   // continue to process same line
                            c += skip;
                        }
                    }
                }
            }
        }



        /// <summary>
        /// Determine if the texts (starting at element t, char c) exactly contain the find text
        /// </summary>
        /// <param name="texts"></param>
        /// <param name="t"></param>
        /// <param name="c"></param>
        /// <param name="find"></param>
        /// <returns>null or the result info</returns>
        static Match IsMatch(IEnumerable<Text> texts, int t, int c, string find)
        {
            int ix = 0;
            for (int i = t; i < texts.Count(); i++)
            {
                for (int j = c; j < texts.ElementAt(i).Text.Length; j++)
                {
                    if (find[ix] != texts.ElementAt(i).Text[j])
                    {
                        return null; // element mismatch
                    }
                    ix++; // match; go to next character
                    if (ix == find.Length)
                        return new Match() { EndElementIndex = i, EndCharIndex = j }; // full match with no issues
                }
                c = 0; // reset char index for next text element
            }
            return null; // ran out of text, not a string match
        }

        /// <summary>
        /// Defines a match result
        /// </summary>
        class Match
        {
            /// <summary>
            /// Last matching element index containing part of the search text
            /// </summary>
            public int EndElementIndex { get; set; }
            /// <summary>
            /// Last matching char index of the search text in last matching element
            /// </summary>
            public int EndCharIndex { get; set; }
        }

     }   // class
}  // namespace


public static class OpenXmlTools
    {
        // filters control characters but allows only properly-formed surrogate sequences
        private static Regex _invalidXMLChars = new Regex(
            @"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]",
            RegexOptions.Compiled);
        /// <summary>
        /// removes any unusual unicode characters that can't be encoded into XML which give exception on save
        /// </summary>
        public static string RemoveInvalidXMLChars(string text)
        {
            if (string.IsNullOrEmpty(text)) return "";
            return _invalidXMLChars.Replace(text, "");
        }
    }
Susannasusannah answered 27/4, 2015 at 20:1 Comment(4)
Excuse me, I get an error with Visual Studio: document.MainDocumentPart.Document.Body is of Body type, but public static void ReplaceText(Paragraph paragraph, string find, string replaceWith) requires a Paragraph. So the compiler stops and doesn't continueHumboldt
@Ozeta, add this var body = doc.MainDocumentPart.Document.Body; var paragraphs = body.Elements<Paragraph>(); foreach(var p in paragraphs) { ReplaceText(p, replace, with); } However, this solution works perfect, thanks @AmosSandarac
Thanks for the paragraph approach, helped me a lot.Superstructure
This answer appears to be the breakthrough I neededTurmoil
R
6

Maybe this solution is easier

using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, true))
{
 string docText = null;
 //1. Copy all the file into a string
 using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
     docText = sr.ReadToEnd();

 //2. Use regular expression to replace all text
 Regex regexText = new Regex(find);
 docText = regexText.Replace(docText, replace);

 //3. Write the changed string into the file again
 using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
      sw.Write(docText);
Richmound answered 17/6, 2014 at 8:26 Comment(1)
Warning: This replaces xml tags also. A find "<" with a replace "string.Empty" ends up with an corrupt documentPurity

© 2022 - 2024 — McMap. All rights reserved.