Python docx Replace string in paragraph while keeping style
Asked Answered
W

10

21

I need help replacing a string in a word document while keeping the formatting of the entire document.

I'm using python-docx, after reading the documentation, it works with entire paragraphs, so I loose formatting like words that are in bold or italics. Including the text to replace is in bold, and I would like to keep it that way. I'm using this code:

from docx import Document
def replace_string2(filename):
    doc = Document(filename)
    for p in doc.paragraphs:
        if 'Text to find and replace' in p.text:
            print 'SEARCH FOUND!!'
            text = p.text.replace('Text to find and replace', 'new text')
            style = p.style
            p.text = text
            p.style = style
    # doc.save(filename)
    doc.save('test.docx')
    return 1

So if I implement it and want something like (the paragraph containing the string to be replaced loses its formatting):

This is paragraph 1, and this is a text in bold.

This is paragraph 2, and I will replace old text

The current result is:

This is paragraph 1, and this is a text in bold.

This is paragraph 2, and I will replace new text

Willman answered 14/1, 2016 at 0:39 Comment(3)
You could try using indices. Ex for p in range(len(doc.paragraphs)): . . ., and then set the paragraph back by doc.paragraphs[p] = text, assuming the doc.paragraphs returns a list like the documentation says.Maltreat
I think this gives the same output I'm currently getting. The file keeps its formatting, except the paragraph that contains the string to be replaced. Please correct me if I'm wrong.Willman
Text formatting is not saved when using assignment. "Paragraph-level formatting, such as style, is preserved. All run-level formatting, such as bold or italic, is removed."Maltreat
W
27

I posted this question (even though I saw a few identical ones on here), because none of those (to my knowledge) solved the issue. There was one using a oodocx library, which I tried, but did not work. So I found a workaround.

The code is very similar, but the logic is: when I find the paragraph that contains the string I wish to replace, add another loop using runs. (this will only work if the string I wish to replace has the same formatting).

def replace_string(filename):
    doc = Document(filename)
    for p in doc.paragraphs:
        if 'old text' in p.text:
            inline = p.runs
            # Loop added to work with runs (strings with same style)
            for i in range(len(inline)):
                if 'old text' in inline[i].text:
                    text = inline[i].text.replace('old text', 'new text')
                    inline[i].text = text
            print p.text

    doc.save('dest1.docx')
    return 1
Willman answered 14/1, 2016 at 6:40 Comment(4)
I don't think this can possible work in all cases, the inline list refers to the internal items of the paragraph, hence the paragrah text is compound of them you won't find the full replacement text in one itemFloria
The thing is that if you find the word you want to replace in a line, you will still replace the whole line by doing this instead of just the one word you want to replace. It's still better than replacing the whole paragraph though.Eyelid
The run creation is very unpredictable and uncontrollable. I'm creating "tags" in the document to replace them and the runs are breaking apart my tag. For Ex: Tag -> ${cma-tag-one}, Runs -> [${, cma-tag, one, }]. Tag -> cmatagthree, Runs -> [cma, tag, thr, ee]. Tag -> mytag, Runs -> [mytag]. Tag -> ImTryingToFindYou, Runs -> [Im, Trying, To, Find, You]. Some using runs could not be the best approach because you may be unable to find your tagsRuffianism
If search string is mixed with characters and symbols, the string will breaks into parts and the text cannot be found and replaced. I found code in this link works.Sublimate
F
16

According to the architecture of the DOCX document:

  1. Text: docx>Paragraphs>runs>run
  2. Text table: docx>tables>rows>cells>Paragraphs>runs>run
  3. Header: docx>sections>header>Paragraphs>runs>run
  4. Header tables: docx>sections>header>tables>row>cells>Paragraphs>runs>run Document Structure

The footer is the same as the header, we can directly traverse the paragraph to find and replace our keywords, but this will cause the text format to be reset, so we can only traverse the words in the run and replace them. However, as our keywords may exceed the length range of the run, we cannot replace them successfully.

Therefore, I provide an idea here: firstly, take paragraph as unit, and mark the position of every character in paragraph through list; then, mark the position of every character in run through list; find keywords in paragraph, delete and replace them by character as unit by corresponding relation.

#!/usr/bin/env python 3.9
# -*- coding: utf-8 -*-
# @Time    : 2022/12/6 18:02 update
# @Author  : ZCG
# @File    : WordReplace.py
# @Software: PyCharm
# @Notice  :


from docx import Document
import os



class Execute:
    '''
        Execute Paragraphs KeyWords Replace
        paragraph: docx paragraph
    '''

    def __init__(self, paragraph):
        self.paragraph = paragraph


    def p_replace(self, x:int, key:str, value:str):
        '''
        paragraph replace
        The reason why you do not replace the text in a paragraph directly is that it will cause the original format to
        change. Replacing the text in runs will not cause the original format to change
        :param x:       paragraph id
        :param key:     Keywords that need to be replaced
        :param value:   The replaced keywords
        :return:
        '''
        # Gets the coordinate index values of all the characters in this paragraph [{run_index , char_index}]
        p_maps = [{"run": y, "char": z} for y, run in enumerate(self.paragraph.runs) for z, char in enumerate(list(run.text))]
        # Handle the number of times key occurs in this paragraph, and record the starting position in the list.
        # Here, while self.text.find(key) >= 0, the {"ab":"abc"} term will enter an endless loop
        # Takes a single paragraph as an independent body and gets an index list of key positions within the paragraph, or if the paragraph contains multiple keys, there are multiple index values
        k_idx = [s for s in range(len(self.paragraph.text)) if self.paragraph.text.find(key, s, len(self.paragraph.text)) == s]
        for i, start_idx in enumerate(reversed(k_idx)):       # Reverse order iteration
            end_idx = start_idx + len(key)                    # The end position of the keyword in this paragraph
            k_maps = p_maps[start_idx:end_idx]                # Map Slice List A list of dictionaries for sections that contain keywords in a paragraph
            self.r_replace(k_maps, value)
            print(f"\t |Paragraph {x+1: >3}, object {i+1: >3} replaced successfully! | {key} ===> {value}")


    def r_replace(self, k_maps:list, value:str):
        '''
        :param k_maps: The list of indexed dictionaries containing keywords, e.g:[{"run":15, "char":3},{"run":15, "char":4},{"run":16, "char":0}]
        :param value:
        :return:
        Accept arguments, removing the characters in k_maps from back to front, leaving the first one to replace with value
        Note: Must be removed in reverse order, otherwise the list length change will cause IndedxError: string index out of range
        '''
        for i, position in enumerate(reversed(k_maps), start=1):
            y, z = position["run"], position["char"]
            run:object = self.paragraph.runs[y]         # "k_maps" may contain multiple run ids, which need to be separated
            # Pit: Instead of the replace() method, str is converted to list after a single word to prevent run.text from making an error in some cases (e.g., a single run contains a duplicate word)
            thisrun = list(run.text)
            if i < len(k_maps):
                thisrun.pop(z)          # Deleting a corresponding word
            if i == len(k_maps):        # The last iteration (first word), that is, the number of iterations is equal to the length of k_maps
                thisrun[z] = value      # Replace the word in the corresponding position with the new content
            run.text = ''.join(thisrun) # Recover



class WordReplace:
    '''
        file: Microsoft Office word file,only support .docx type file
    '''

    def __init__(self, file):
        self.docx = Document(file)

    def body_content(self, replace_dict:dict):
        print("\t☺Processing keywords in the body...")
        for key, value in replace_dict.items():
            for x, paragraph in enumerate(self.docx.paragraphs):
                Execute(paragraph).p_replace(x, key, value)
        print("\t |Body keywords in the text are replaced!")


    def body_tables(self,replace_dict:dict):
        print("\t☺Processing keywords in the body'tables...")
        for key, value in replace_dict.items():
            for table in self.docx.tables:
                for row in table.rows:
                    for cell in row.cells:
                        for x, paragraph in enumerate(cell.paragraphs):
                            Execute(paragraph).p_replace(x, key, value)
        print("\t |Body'tables keywords in the text are replaced!")


    def header_content(self,replace_dict:dict):
        print("\t☺Processing keywords in the header'body ...")
        for key, value in replace_dict.items():
            for section in self.docx.sections:
                for x, paragraph in enumerate(section.header.paragraphs):
                    Execute(paragraph).p_replace(x, key, value)
        print("\t |Header'body keywords in the text are replaced!")


    def header_tables(self,replace_dict:dict):
        print("\t☺Processing keywords in the header'tables ...")
        for key, value in replace_dict.items():
            for section in self.docx.sections:
                for table in section.header.tables:
                    for row in table.rows:
                        for cell in row.cells:
                            for x, paragraph in enumerate(cell.paragraphs):
                                Execute(paragraph).p_replace(x, key, value)
        print("\t |Header'tables keywords in the text are replaced!")


    def footer_content(self, replace_dict:dict):
        print("\t☺Processing keywords in the footer'body ...")
        for key, value in replace_dict.items():
            for section in self.docx.sections:
                for x, paragraph in enumerate(section.footer.paragraphs):
                    Execute(paragraph).p_replace(x, key, value)
        print("\t |Footer'body keywords in the text are replaced!")


    def footer_tables(self, replace_dict:dict):
        print("\t☺Processing keywords in the footer'tables ...")
        for key, value in replace_dict.items():
            for section in self.docx.sections:
                for table in section.footer.tables:
                    for row in table.rows:
                        for cell in row.cells:
                            for x, paragraph in enumerate(cell.paragraphs):
                                Execute(paragraph).p_replace(x, key, value)
        print("\t |Footer'tables keywords in the text are replaced!")


    def save(self, filepath:str):
        '''
        :param filepath: File saving path
        :return:
        '''
        self.docx.save(filepath)


    @staticmethod
    def docx_list(dirPath):
        '''
        :param dirPath:
        :return: List of docx files in the current directory
        '''
        fileList = []
        for roots, dirs, files in os.walk(dirPath):
            for file in files:
                if file.endswith("docx") and file[0] != "~":  # Find the docx document and exclude temporary files
                    fileRoot = os.path.join(roots, file)
                    fileList.append(fileRoot)
        print("This directory finds a total of {0} related files!".format(len(fileList)))
        return fileList



def main():
    '''
    To use: Modify the values in replace dict and filedir
    replace_dict :key:to be replaced, value:new content
    filedir :Directory where docx files are stored. Subdirectories are supported
    '''
    # input section
    replace_dict = {
        "aaa":"bbb",
        "ccc":"ddd",
        }
    filedir = r"D:\Working Files\svn"

    # Call processing section
    for i, file in enumerate(WordReplace.docx_list(filedir),start=1):
        print(f"{i}、Processing file:{file}")
        wordreplace = WordReplace(file)
        wordreplace.header_content(replace_dict)
        wordreplace.header_tables(replace_dict)
        wordreplace.body_content(replace_dict)
        wordreplace.body_tables(replace_dict)
        wordreplace.footer_content(replace_dict)
        wordreplace.footer_tables(replace_dict)
        wordreplace.save(file)
        print(f'\t☻The document processing is complete!\n')



if __name__ == "__main__":
    main()
    print("All complete!")
Franglais answered 21/4, 2021 at 2:58 Comment(6)
char = self.runs[y].text[z] IndexError: string index out of range, do you know the reason?Antemeridian
@RickyWu Thank you for your reminder, this is a bug, that is, if I directly replace the content of run.text, when the same word exists in run.text at the same time, it will be replaced together, so an IndexError will be raised, so I use delete The index method of run.text solves the problem, please refer to my re-edited content for the new answer.Franglais
Really powerful script!Lesh
@Zhang Thank you very much for a great explanation and source code. It worked like a charm for my case. I have a DOCX containing table and images and your code just replaced the text I wanted without modifying anything else. Great job :)Barcelona
I think this will not replace text in hyperlinks, e.g. the table of content.Detailed
Great job but I'd like to say one thing about performance. Invoking the entire methods, is super heavy duty. For example running it on 5 docx files that each file is 2-3 pages takes 3 minutes. I've even created different threads for each doc for multi-threading env.Bonze
G
12

This is what works for me to retain the text style when replacing text.

Based on Alo's answer and the fact the search text can be split over several runs, here's what worked for me to replace placeholder text in a template docx file. It checks all the document paragraphs and any table cell contents for the placeholders.

Once the search text is found in a paragraph it loops through it's runs identifying which runs contains the partial text of the search text, after which it inserts the replacement text in the first run then blanks out the remaining search text characters in the remaining runs.

I hope this helps someone. Here's the gist if anyone wants to improve it

Edit: I have subsequently discovered python-docx-template which allows jinja2 style templating within a docx template. Here's a link to the documentation

def docx_replace(doc, data):
    paragraphs = list(doc.paragraphs)
    for t in doc.tables:
        for row in t.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    paragraphs.append(paragraph)
    for p in paragraphs:
        for key, val in data.items():
            key_name = '${{{}}}'.format(key) # I'm using placeholders in the form ${PlaceholderName}
            if key_name in p.text:
                inline = p.runs
                # Replace strings and retain the same style.
                # The text to be replaced can be split over several runs so
                # search through, identify which runs need to have text replaced
                # then replace the text in those identified
                started = False
                key_index = 0
                # found_runs is a list of (inline index, index of match, length of match)
                found_runs = list()
                found_all = False
                replace_done = False
                for i in range(len(inline)):

                    # case 1: found in single run so short circuit the replace
                    if key_name in inline[i].text and not started:
                        found_runs.append((i, inline[i].text.find(key_name), len(key_name)))
                        text = inline[i].text.replace(key_name, str(val))
                        inline[i].text = text
                        replace_done = True
                        found_all = True
                        break

                    if key_name[key_index] not in inline[i].text and not started:
                        # keep looking ...
                        continue

                    # case 2: search for partial text, find first run
                    if key_name[key_index] in inline[i].text and inline[i].text[-1] in key_name and not started:
                        # check sequence
                        start_index = inline[i].text.find(key_name[key_index])
                        check_length = len(inline[i].text)
                        for text_index in range(start_index, check_length):
                            if inline[i].text[text_index] != key_name[key_index]:
                                # no match so must be false positive
                                break
                        if key_index == 0:
                            started = True
                        chars_found = check_length - start_index
                        key_index += chars_found
                        found_runs.append((i, start_index, chars_found))
                        if key_index != len(key_name):
                            continue
                        else:
                            # found all chars in key_name
                            found_all = True
                            break

                    # case 2: search for partial text, find subsequent run
                    if key_name[key_index] in inline[i].text and started and not found_all:
                        # check sequence
                        chars_found = 0
                        check_length = len(inline[i].text)
                        for text_index in range(0, check_length):
                            if inline[i].text[text_index] == key_name[key_index]:
                                key_index += 1
                                chars_found += 1
                            else:
                                break
                        # no match so must be end
                        found_runs.append((i, 0, chars_found))
                        if key_index == len(key_name):
                            found_all = True
                            break

                if found_all and not replace_done:
                    for i, item in enumerate(found_runs):
                        index, start, length = [t for t in item]
                        if i == 0:
                            text = inline[index].text.replace(inline[index].text[start:start + length], str(val))
                            inline[index].text = text
                        else:
                            text = inline[index].text.replace(inline[index].text[start:start + length], '')
                            inline[index].text = text
                # print(p.text)

# usage

doc = docx.Document('path/to/template.docx')
docx_replace(doc, dict(ItemOne='replacement text', ItemTwo="Some replacement text\nand some more")
doc.save('path/to/destination.docx')
Gev answered 17/4, 2019 at 17:25 Comment(0)
M
5

After analyzing a lot of suggestions and solutions across the web, I developed a python package based on all information I could find, plus a new strategy to keep all the format you have.

You can find more information at: https://github.com/ivanbicalho/python-docx-replace

Or, if you prefer, just install the package and use it, here's an example:

pip3 install python-docx-replace 
from python_docx_replace.docx_replace import docx_replace

# get your document using python-docx
doc = Document("document.docx")

# call the replace function with your key value pairs
docx_replace(doc, name="Ivan", phone="+55123456789")

# do whatever you want after that, usually save the document
doc.save("replaced.docx")
Markova answered 16/6, 2022 at 2:34 Comment(5)
This should get more ups, this is the only solution that works wellOnionskin
This is more a word template solution. Your document.docx must contain keys in the format ${name} and ${phone} to be replaced by your code. But the question here asks for a general search and replace not limited to a specific key format.Detailed
I've really tried to use your library but it doesn't do the replacement, simply doesn't work. Tried the dictionary API and the one word expression API, doesn't work. What am I doing wrong ?Bonze
Can you share a piece of your code here? I can take a look for you.Markova
Solução perfeita e precisa, meu amigo! Good Job! Worked for me, thanks!Trimaran
W
4
from docx import Document

document = Document('old.docx')

dic = {
    '{{FULLNAME}}':'First Last',
    '{{FIRST}}':'First',
    '{{LAST}}' : 'Last',
}
for p in document.paragraphs:
    inline = p.runs
    for i in range(len(inline)):
        text = inline[i].text
        for key in dic.keys():
            if key in text:
                 text=text.replace(key,dic[key])
                 inline[i].text = text


document.save('new.docx')
Weepy answered 5/8, 2019 at 19:52 Comment(3)
Welcome Zain. Could you, please, add some context to your answer, explaining why the code you posted fixes the issue on the original question?Longcloth
The run creation is very unpredictable and uncontrollable. I'm creating "tags" in the document to replace them and the runs are breaking apart my tag. For Ex: Tag -> ${cma-tag-one}, Runs -> [${, cma-tag, one, }]. Tag -> cmatagthree, Runs -> [cma, tag, thr, ee]. Tag -> mytag, Runs -> [mytag]. Tag -> ImTryingToFindYou, Runs -> [Im, Trying, To, Find, You]. Some using runs could not be the best approach because you may be unable to find your tagsRuffianism
One flaw in your method is that it won't work if you need to replace text that spans two “run” or goes beyond the scope of 'run'Franglais
T
1

https://gist.github.com/heimoshuiyu/671a4dfbd13f7c279e85224a5b6726c0

This use a "shuttle" so it can find the key which cross multiple runs. This is similar to the "Replace All" behavior in MS Word

def shuttle_text(shuttle):
    t = ''
    for i in shuttle:
        t += i.text
    return t

def docx_replace(doc, data):
    for key in data:

        for table in doc.tables:
            for row in table.rows:
                for cell in row.cells:
                    if key in cell.text:
                        cell.text = cell.text.replace(key, data[key])

        for p in doc.paragraphs:

            begin = 0
            for end in range(len(p.runs)):

                shuttle = p.runs[begin:end+1]

                full_text = shuttle_text(shuttle)
                if key in full_text:
                    # print('Replace:', key, '->', data[key])
                    # print([i.text for i in shuttle])

                    # find the begin
                    index = full_text.index(key)
                    # print('full_text length', len(full_text), 'index:', index)
                    while index >= len(p.runs[begin].text):
                        index -= len(p.runs[begin].text)
                        begin += 1

                    shuttle = p.runs[begin:end+1]

                    # do replace
                    # print('before replace', [i.text for i in shuttle])
                    if key in shuttle[0].text:
                        shuttle[0].text = shuttle[0].text.replace(key, data[key])
                    else:
                        replace_begin_index = shuttle_text(shuttle).index(key)
                        replace_end_index = replace_begin_index + len(key)
                        replace_end_index_in_last_run = replace_end_index - len(shuttle_text(shuttle[:-1]))
                        shuttle[0].text = shuttle[0].text[:replace_begin_index] + data[key]

                        # clear middle runs
                        for i in shuttle[1:-1]:
                            i.text = ''

                        # keep last run
                        shuttle[-1].text = shuttle[-1].text[replace_end_index_in_last_run:]

                    # print('after replace', [i.text for i in shuttle])

                    # set begin to next
                    begin = end

# usage

doc = docx.Document('path/to/template.docx')
docx_replace(doc, dict(ItemOne='replacement text', ItemTwo="Some replacement text\nand some more")
doc.save('path/to/destination.docx')
Tonsillitis answered 25/1, 2022 at 3:22 Comment(1)
This methods seems to work as well as that of https://mcmap.net/q/302849/-python-docx-replace-string-in-paragraph-while-keeping-style. Simply replace p_replace by the shuttle method above.Fjeld
S
1

In case it serves to somebody, this worked for me. It does not keep the same format as in the opened document, but puts the format you want in case of replacing words going paragraph by paragraph:

from docx import Document
from docx.shared import Pt

word = "word to be replaced"
replacement = "word to replace"
document = Document()
style = document.styles['Normal']
font = style.font
font.name = 'Arial'
font.size = Pt(10)
for paragraph in document.paragraphs:
    if word in paragraph.text:
        paragraph.text = paragraph.text.replace(word,replacement)
        paragraph.style = document.styles['Normal']

Hope it serves!

Selfinsurance answered 11/11, 2022 at 11:50 Comment(0)
S
0

There's a new add-on module that accomplishes this exact functionality and it can be used with the docx library. When replacing strings, it keeps the original format of the text in the document.

Simply import docxedit and use the provided functions.

For example:

from docx import Document
import docxedit

document = Document()

# Replace all instances of the word 'Hello' in the document with 'Goodbye'
docxedit.replace_string(document, old_string='Hello', new_string='Goodbye')

# Replace all instances of the word 'Hello' in the document with 'Goodbye' but only
# up to paragraph 10
docxedit.replace_string_up_to_paragraph(document, old_string='Hello', new_string='Goodbye', 
                                        paragraph_number=10)

# Remove any line that contains the word 'Hello' along with the next 5 lines
docxedit.remove_lines(document, first_line='Hello', number_of_lines=5)

See more documentation here: https://github.com/henrihapponen/docxedit

Souffle answered 6/11, 2022 at 15:43 Comment(1)
I tried to replace a string in a table that was in the header. The word was not found.Postman
S
0

Try this link: https://thepythoncode.com/assistant/code-generator/python/

It generated a very similar code to the one you found out.

from docx import Document

def replace_word_in_docx(input_file, output_file, old_word, new_word):
"""
Replace a word in a Microsoft Word document while keeping the same 
styles.

Args:
input_file (str): Path to the input Word document.
output_file (str): Path to the output Word document.
old_word (str): The word to be replaced.
new_word (str): The new word to replace the old word.

Returns:
None
"""
doc = Document(input_file)

for paragraph in doc.paragraphs:
    if old_word in paragraph.text:
        inline = paragraph.runs
        for i in range(len(inline)):
            if old_word in inline[i].text:
                text = inline[i].text.replace(old_word, new_word)
                inline[i].text = text

doc.save(output_file)
Speciosity answered 5/1 at 19:22 Comment(0)
A
0

A lot of good ideas here. To add to what's already been said this post humbly offers A simple function to reduce the numbers of runs. Thereby reducing the chance the text you're searching for is broken up across multiple runs.

Note, it only accounts for the 5 styles specified. I also force colons at the start of a run into the previous run, and force everything between square brackets in a single run(relevant for my use case, but maybe not for yours).

def merge_superflous_runs(cell):
    """
        Where multiple consecutive runs have the same styles, combine them into one run.
        Also forces [square bracket content into a single run.]
        and moves any colons: at the start of a run to the end of the previous run.
    """
    for p in cell.paragraphs:
        # check styles of each run in p
        # If a run has the same styles as the previous run combine them
        previous_run_style = ''
        current_run_style = ''
        new_run_list = []

        for run in p.runs:

            if run.font.bold:
                current_run_style += 'bold'
            if run.font.italic:
                current_run_style += 'italic'
            if run.font.underline:
                current_run_style += 'underline'
            if run.font.superscript:
                current_run_style += 'superscript'
            if run.font.subscript:
                current_run_style += 'subscript'

            previous_run_open_bracket_pos = previous_run_style.rfind('[')
            previous_run_close_bracket_pos = previous_run_style.rfind(']')

            if len(new_run_list) and (current_run_style == previous_run_style or (previous_run_open_bracket_pos > -1 and previous_run_close_bracket_pos < previous_run_open_bracket_pos)):

                new_run_list[-1].text += run.text

            elif len(run.text.strip()) and run.text.strip()[0] == ':':
                new_run_list[-1].text += ':'
                colon_index = run.text.find(':')
                run.text = run.text[colon_index+1:]
                new_run_list.append(run)
                previous_run_style = current_run_style

            else:
                new_run_list.append(run)
                previous_run_style = current_run_style

            current_run_style = ''

            # remove current run
            run._element.getparent().remove(run._element)

        # add new runs
        for new_run in new_run_list:

            p._p.append(new_run._element)
    
    return cell
Alexander answered 17/1 at 15:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.