Copying .docx and preserving images
Asked Answered
K

3

7

I am trying to copy elements of a doc from one doc file to other. The text part is easy, the images is where it gets tricky. Attaching an image to explain the structure of the doc: Just some text and 1 image.

enter image description here

from docx import Document
import io
doc = Document('/Users/neha/Desktop/testing.docx')


new_doc = Document()

for elem in doc.element.body:
    new_doc.element.body.append(elem)
new_doc.save('/Users/neha/Desktop/out.docx')

This gets me the whole structure of the doc in the new_doc but the image is still blank. Image below:

enter image description here

Good thing is I have the blank image in the right place so I thought of getting the byte level data from the previous image and insert it in the new doc. Here is how I extended the above code:

from docx import Document
import io
doc = Document('/Users/neha/Desktop/testing.docx')


new_doc = Document()

for elem in doc.element.body:
    new_doc.element.body.append(elem)

im = doc.inline_shapes[0]

blip = im._inline.graphic.graphicData.pic.blipFill.blip
rId = blip.embed


doc_part = doc.part
image_part = doc_part.related_parts[rId]
bytes = image_part._blob        #Here I get the byte level data for the image

im2 = new_doc.inline_shapes[0]
blip2 = im2._inline.graphic.graphicData.pic.blipFill.blip
rId2 = blip2.embed       
document_part2 = new_doc.part
document_part2.related_parts[rId2]._blob = bytes
new_doc.save('/Users/neha/Desktop/out.docx')

But the image still shows empty in the new_doc. What should I do from here?

Kornher answered 4/8, 2018 at 15:49 Comment(1)
possibly relevant: python-docx.readthedocs.io/en/latest/user/shapes.htmlSwatow
K
6

I figured out a solution a couple of days back. However the text loses formatting using this way, but the images are correctly placed.

So the idea is, for para in paras for the source doc, if there is text, I write it to dest doc. And if there is an inline image present, I add a unique identifier at that place in the dest doc (refer here to see how these identifiers work, and contexts in docxtpl). These identifiers and docxtpl proved to be particularly useful here. And then using those unique identifiers I create a 'context' (as shown below) which is basically a map mapping the unique identifier to its particular InlineImage, and finally I render this context..

Below is my code (Apologies for the unnecessary indentation, I copied it directly from my text editor, and shift+tab doesn't work here :P)

        from docxtpl import DocxTemplate, InlineImage
        import Document
        import io
        import xml.etree.ElementTree as ET

        dest = DocxTemplate() 
        source = Document(source_path)
        context = {}
        ims = [im for im in source.inline_shapes]
        im_addresses = []
        im_streams = []
        count = 0
        for im in ims:
            blip = im._inline.graphic.graphicData.pic.blipFill.blip
            rId = blip.embed
            doc_part = source.part
            image_part = doc_part.related_parts[rId]
            byte_data = image_part._blob
            image_stream = io.BytesIO(byte_data)
            im_streams.append(image_stream)
            image_name = self.img_path+"img_"+"_"+str(count)+".jpeg"

            with open(image_name, "wb") as fh:
                fh.write(byte_data)
            fh.close()

            im_addresses.append(image_name)

            count += 1
        paras = source.paragraphs
        im_idx = 0

        for para in paras:
            p = dest.add_paragraph()
            r = p.add_run()
            if(para.text):
                r.add_text(para.text)
            root = ET.fromstring(para._p.xml)
            namespace = {'wp':"http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"}

            inlines = root.findall('.//wp:inline',namespace)

            if(len(inlines) > 0):
                uid = "img_"+str(im_idx)

                r.add_text("{{ " + uid + " }}")


                context[uid] = InlineImage(dest,im_addresses[im_idx])
                im_idx += 1

        try:
            dest.render(context)
        except Exception as e:
            print(e)
        dest.save(dest_path)

PS: If a paragraph has two images, this code will prove to be sub-optimal.. One will have to make some change in the following:

if(len(inlines) > 0):
    uid = "img_"+str(im_idx)
    r.add_text("{{ " + uid + " }}")
    context[uid] = InlineImage(dest,im_addresses[im_idx])
    im_idx += 1

Will have to add a for loop inside the if statement as well. Since I didn't need as usually my images were big enough, so they always came in different paragraphs. Just a side note for anyone who may need it..

Cheers!

Kornher answered 11/8, 2018 at 19:2 Comment(2)
Question, where do you get the import Document from? as far as i know this is not a starndard packageHaggerty
Also Exception has occurred: AttributeError 'NoneType' object has no attribute 'add_paragraph' any idea?Haggerty
M
1

You could try:

  1. Extracting the images from the first document by unzipping the .docx file (per How can I search a word in a Word 2007 .docx file?)
  2. Save those images to the file system (as foo.png, for instance)
  3. Generate the new .docx file with Python and add the .png file using document.add_picture('foo.png').
Musketry answered 8/8, 2018 at 18:22 Comment(3)
Thanks for your answer. You don't show how to figure out the exact location in which foo.png is inserted in the doc.Maybe you have figured it and didn't add it here...Also, unzipping in MacOS using python is an option? How? I am all ears for that. Anyways, I figured it out a way a couple of days back.... I used the docxtpl library along with parsing the xml tree of .docx file. I will post the answer soon...Kornher
Good point - I guess this approach is sub-optimal unless you were manually creating the second document. Maybe you could diff the two .docx archives after the creation of the second one to determine if the pointer to the .png file is correct?Musketry
Have posted my answer.. do let me know what you think :)Kornher
S
0

This problem is solved by this package https://docxtpl.readthedocs.io/en/latest/

Stupor answered 18/8, 2022 at 7:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.