Convert DOCX Bytestream to PDF Bytestream Python
Asked Answered
N

2

7

I currently have a program that generates a .docx document using the python-docx library.

Upon completing the building of the .docx file I save it into a Bytestream as so

file_stream = io.BytesIO()
document.save(file_stream)
file_stream.seek(0)

Now, I need to convert this word document into a PDF. I have looked at a few different libraries for conversion such as docx2pdf or even doing it manually using comtypes as so

import sys
import os
import comtypes.client

wdFormatPDF = 17

in_file = "Input_file_path.docx"
out_file = "output_file_path.pdf"

word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()

The problem is, I need to do this conversion in memory and cannot physically save the DOCX or the PDF to the machine. Every converter I've seen requires a filepath to the physical document on the machine and I do not have that.

Is there a way I can convert the DOCX filestream into a PDF stream just in memory?

Thanks

Nieberg answered 4/9, 2020 at 22:9 Comment(1)
Have you found an answer for this? i have a similar requirementElatia
W
6

This method is a little convoluted, but it works entirely in memory, and you get the option to add custom CSS to style the final document.
Convert the DOCX bytestream to HTML using mammoth, and the resulting HTML to PDF using pdfkit.

Here's an example

# create a dummy docx file
from docx import Document
document = Document()
document.add_paragraph('Lorem ipsum dolor sit amet.')

# create a bytestream
import io
file_stream = io.BytesIO()
document.save(file_stream)
file_stream.seek(0)

# convert the docx to html
import mammoth
result = mammoth.convert_to_html(file_stream)

# >>> result.value
# >>> '<p>Lorem ipsum dolor sit amet.</p>'

# convert html to pdf
import pdfkit
pdf = pdfkit.from_string(result.value)

If you want to output the stream to a file, just do

with open('test.pdf','wb') as file:
    file.write(pdf)
Wnw answered 16/11, 2021 at 21:33 Comment(1)
Does this even show imagesHypnosis
H
-4

i think you want to convert docx file to pdf file , then follow these steps:

Install the package:

pip install docx2pdf

Code:

from docx2pdf import convert

convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")

Thanks and if I am mistaken please tell me to correct it🙌

Hydromancy answered 4/9, 2020 at 22:15 Comment(2)
The question was if the conversion could be done in memory. This is still with files. OP knows how to convert already, just not without involving files.Cephalic
Thanks for the reply Revisto. Yes I saw that with docx2pdf, but I need to do it fully in memoryNieberg

© 2022 - 2024 — McMap. All rights reserved.