Text-Replace in docx and save the changed file with python-docx
Asked Answered
T

8

12

I'm trying to use the python-docx module to replace a word in a file and save the new file with the caveat that the new file must have exactly the same formatting as the old file, but with the word replaced. How am I supposed to do this?

The docx module has a savedocx that takes 7 inputs:

  • document
  • coreprops
  • appprops
  • contenttypes
  • websettings
  • wordrelationships
  • output

How do I keep everything in my original file the same except for the replaced word?

Tanyatanzania answered 25/7, 2013 at 6:3 Comment(3)
What have you tried? You could also get more help giving links to the module.Laurence
github.com/mikemaccana/python-docxHatchway
Almost all those args have good default values. You can pretty much ignore enerything but outpout and document.Huntington
T
8

As it seems to be, Docx for Python is not meant to store a full Docx with images, headers, ... , but only contains the inner content of the document. So there's no simple way to do this.

Howewer, here is how you could do it:

First, have a look at the docx tag wiki:

It explains how the docx file can be unzipped: Here's how a typical file looks like:

+--docProps
|  +  app.xml
|  \  core.xml
+  res.log
+--word //this folder contains most of the files that control the content of the document
|  +  document.xml //Is the actual content of the document
|  +  endnotes.xml
|  +  fontTable.xml
|  +  footer1.xml //Containst the elements in the footer of the document
|  +  footnotes.xml
|  +--media //This folder contains all images embedded in the word
|  |  \  image1.jpeg
|  +  settings.xml
|  +  styles.xml
|  +  stylesWithEffects.xml
|  +--theme
|  |  \  theme1.xml
|  +  webSettings.xml
|  \--_rels
|     \  document.xml.rels //this document tells word where the images are situated
+  [Content_Types].xml
\--_rels
   \  .rels

Docx only gets one part of the document, in the method opendocx

def opendocx(file):
    '''Open a docx file, return a document XML tree'''
    mydoc = zipfile.ZipFile(file)
    xmlcontent = mydoc.read('word/document.xml')
    document = etree.fromstring(xmlcontent)
    return document

It only gets the document.xml file.

What I recommend you to do is:

  1. get the content of the document with **opendocx*
  2. Replace the document.xml with the advReplace method
  3. Open the docx as a zip, and replace the document.xml content's by the new xml content.
  4. Close and output the zipped file (renaming it to output.docx)

If you have node.js installed, be informed that I have worked on DocxGenJS which is templating engine for docx documents, the library is in active development and will be released soon as a node module.

Tylertylosis answered 25/7, 2013 at 15:46 Comment(1)
Thanks, I did another search for how to do this procedure you outlined and found this other question which shows how to do it: #16868094Tanyatanzania
R
12

this worked for me:

rep = {'a': 'b'}  # lookup dictionary, replace `a` with `b`
def docx_replace(old_file,new_file,rep):
    zin = zipfile.ZipFile (old_file, 'r')
    zout = zipfile.ZipFile (new_file, 'w')
    for item in zin.infolist():
        buffer = zin.read(item.filename)
        if (item.filename == 'word/document.xml'):
            res = buffer.decode("utf-8")
            for r in rep:
                res = res.replace(r,rep[r])
            buffer = res.encode("utf-8")
        zout.writestr(item, buffer)
    zout.close()
    zin.close()
Repugn answered 31/5, 2015 at 9:53 Comment(4)
wow, this was brilliant. I've been struggling to find a function like this for a while. Thank you!Shearin
Can you elaborate on this please?Kalpa
Yes, what is rep supposed to be? Can you post an example of function call to see the passed rep value format? ThanksMinuteman
rep is a lookup dictionary, e.g. rep = {'a': 'b'}Repugn
T
8

As it seems to be, Docx for Python is not meant to store a full Docx with images, headers, ... , but only contains the inner content of the document. So there's no simple way to do this.

Howewer, here is how you could do it:

First, have a look at the docx tag wiki:

It explains how the docx file can be unzipped: Here's how a typical file looks like:

+--docProps
|  +  app.xml
|  \  core.xml
+  res.log
+--word //this folder contains most of the files that control the content of the document
|  +  document.xml //Is the actual content of the document
|  +  endnotes.xml
|  +  fontTable.xml
|  +  footer1.xml //Containst the elements in the footer of the document
|  +  footnotes.xml
|  +--media //This folder contains all images embedded in the word
|  |  \  image1.jpeg
|  +  settings.xml
|  +  styles.xml
|  +  stylesWithEffects.xml
|  +--theme
|  |  \  theme1.xml
|  +  webSettings.xml
|  \--_rels
|     \  document.xml.rels //this document tells word where the images are situated
+  [Content_Types].xml
\--_rels
   \  .rels

Docx only gets one part of the document, in the method opendocx

def opendocx(file):
    '''Open a docx file, return a document XML tree'''
    mydoc = zipfile.ZipFile(file)
    xmlcontent = mydoc.read('word/document.xml')
    document = etree.fromstring(xmlcontent)
    return document

It only gets the document.xml file.

What I recommend you to do is:

  1. get the content of the document with **opendocx*
  2. Replace the document.xml with the advReplace method
  3. Open the docx as a zip, and replace the document.xml content's by the new xml content.
  4. Close and output the zipped file (renaming it to output.docx)

If you have node.js installed, be informed that I have worked on DocxGenJS which is templating engine for docx documents, the library is in active development and will be released soon as a node module.

Tylertylosis answered 25/7, 2013 at 15:46 Comment(1)
Thanks, I did another search for how to do this procedure you outlined and found this other question which shows how to do it: #16868094Tanyatanzania
T
3

Are you using the docx module from here?

If yes, then the docx module already exposes methods like replace, advReplace etc which can help you achieve your task. Refer to the source code for more details of the exposed methods.

Thither answered 25/7, 2013 at 6:37 Comment(1)
Yes, I'm using the module from there. I used replace to make the changes, but the problem I'm having is that I don't know how to save the changes to the file. This is what I have : 1)document=opendocx('sample.docx') 2)replace(document, "todo", "tada") 3)then what am I supposed to do to save the changes to the doc sample.docx while keeping the same original formatting of the document?Tanyatanzania
G
2
from docx import Document
file_path = 'C:/tmp.docx'
document = Document(file_path)

def docx_replace(doc_obj, data: dict):
    """example: data=dict(order_id=123), result: {order_id} -> 123"""
    for paragraph in doc_obj.paragraphs:
        for key, val in data.items():
            key_name = '{{{}}}'.format(key)
            if key_name in paragraph.text:
                paragraph.text = paragraph.text.replace(key_name, str(val))
    for table in doc_obj.tables:
        for row in table.rows:
            for cell in row.cells:
                docx_replace(cell, data)

docx_replace(document, dict(order_id=123, year=2018, payer_fio='payer_fio', payer_fio1='payer_fio1'))
document.save(file_path)
Gigi answered 12/9, 2018 at 4:39 Comment(2)
Looks like no need key_name. Just if key in paragraph.text: is Okay. But the format seems will be missing.Supercool
I agree, I got it working by using key_name = str(key) I'm not sure if key could ever not be a string here but yeahOratorian
I
2

The problem with the methods above is that they lose the existing formatting. Please see my answer which performs the replace and retains formatting.

There is also python-docx-template which allows jinja2 style templating within a docx template. Here's a link to the documentation

Intendance answered 18/4, 2019 at 8:24 Comment(1)
Thanks, this worked flawlessly for me with a fairly complicated document.Blindfold
E
1

I've forked a repo of python-docx here, which preserves all of the preexisting data in a docx file, including formatting. Hopefully this is what you're looking for.

Elgon answered 3/11, 2013 at 6:5 Comment(0)
L
1

In addition to @ramil, you have to escape some characters before placing them as string values into the XML, so this worked for me:

def escape(escapee):
  escapee = escapee.replace("&", "&")
  escapee = escapee.replace("<", "&lt;")
  escapee = escapee.replace(">", "&gt;")
  escapee = escapee.replace("\"", "&quot;")
  escapee = escapee.replace("'", "&apos;")
return escapee
Liris answered 9/9, 2016 at 7:39 Comment(0)
C
0

We can use python-docx to keep an image on docx. docx detect image as a paragraph. But for this paragraph the text is empty. So you can use like this. paragraphs = document.paragraphs for paragraph in paragraphs: if paragraph.text == '': continue

Coelom answered 21/10, 2020 at 18:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.