An efficient way to convert document to pdf format
Asked Answered
C

4

22

I have been trying to find the efficient way to convert document e.g. doc, docx, ppt, pptx to pdf. So far i have tried docsplit and oowriter, but both took > 10 seconds to complete the job on pptx file having size 1.7MB. Can any one suggest me a better way or suggestions to improve my approach?

What i have tried:

from subprocess import Popen, PIPE
import time

def convert(src, dst):
    d = {'src': src, 'dst': dst}
    commands = [
        '/usr/bin/docsplit pdf --output %(dst)s %(src)s' % d,
        'oowriter --headless -convert-to pdf:writer_pdf_Export %(dst)s %(src)s' % d,
    ]

    for i in range(len(commands)):
        command = commands[i]
        st = time.time()
        process = Popen(command, stdout=PIPE, stderr=PIPE, shell=True) # I am aware of consequences of using `shell=True` 
        out, err = process.communicate()
        errcode = process.returncode
        if errcode != 0:
            raise Exception(err)
        en = time.time() - st
        print 'Command %s: Completed in %s seconds' % (str(i+1), str(round(en, 2)))

if __name__ == '__main__':
    src = '/path/to/source/file/'
    dst = '/path/to/destination/folder/'
    convert(src, dst)

Output:

Command 1: Completed in 11.91 seconds
Command 2: Completed in 11.55 seconds

Environment:

  • Linux - Ubuntu 12.04
  • Python 2.7.3

More tools result:

Chrysarobin answered 2/1, 2014 at 21:0 Comment(11)
Note that this not a real benchmark. A single result doesn't make sense. Results should be calculated as an average from many trials, and also at least standard deviation should be presented.Diphase
@Diphase Thanks for clarification. I have chosen the wrong word.Chrysarobin
Well, since you're interested in efficiency, "benchmark" is the right word to use, because that's the tool to measure efficiency. So your code is wrong, not words :)Diphase
Yes you are correct :P but i was just trying to give a simple scenario to show my problem.Chrysarobin
I understand :) But you can never be sure if anything "strange" didn't happen on your single run - like, you've received an e-mail, OS decided to swap some memory pages to disk, GC started its work - many possibilities :)Diphase
The Microsoft and PDF formats are both very complex. 11 seconds might not be out of line.Escalator
are you trying to minimize a single run or a batch?Accrescent
Does it make a difference if you run those commands in the shell instead of in Python? That is, if you run /usr/bin/docsplit pdf --output dst src without Python.Fair
IMHO you should try running the code several times (e.g. 20) or do it for more similar files and take an average. You might benefit from OS caching (i.e. docsplit and oowriter might remain in memory between runs).Turnsole
Actually my aim is to use these commands through python and use in Django application. Whenever a user uploads a document file which is not a PDF i have to convert it to PDF first. So processing is done as soon as user uploads a file.Chrysarobin
Also when user uploads a file there is a schedule task is created for celery to convert that file to pdf. So single run time needed to be improved here.Chrysarobin
E
18

Try calling unoconv from your Python code, it took 8 seconds on my local machine, I don't know if it's fast enough for you:

time unoconv 15.\ Text-Files.pptx
real    0m8.604s
Effusive answered 6/1, 2014 at 12:42 Comment(3)
Python Uno is the most reliable way to get decent pdf output from various MS Office document types. It uses (Star|Libre|Open)office backend to convert document. In principle you can do more than just convert documents. You can incorporate basic routines as well. I would still use Uno very carefully. Office software are known to be memory hogs. Do look through wiki.openoffice.org/wiki/PyUNO_bridgeShirt
Thanks for your answer i'll try and let you know :)Chrysarobin
Still want it more fast :P but i think that is the best time so far. ThanksChrysarobin
J
3

Pandoc is a wonderful tool capable of doing what you'd like quickly. Since you're using Popen to effectively shell out the command for the tool, it doesn't matter what language the tool is written in (Pandoc is written in Haskell).

Jorgan answered 9/1, 2014 at 16:26 Comment(2)
Thanks for your answer i'll try and let you know :)Chrysarobin
Adding pypi.org/project/pypandoc for people still looking to do this. It removes the need to use Popen to shell out the command.Microphyte
S
2

Unfortunately I don't have the time to do a full benchmark, but you may want to check out xtopdf, my Python toolkit for PDF creation. It doesn't do the full range of conversions you want, and some of the conversions have limitations, but it may be of use. xtopdf links:

Online presentation about xtopdf - a good summary of what it is, what it does, platforms, features, users, uses etc.: http://slid.es/vasudevram/xtopdf

xtopdf on Bitbucket: https://bitbucket.org/vasudevram/xtopdf

Many blog posts showing how to use xtopdf for various purpose, including many that show how to use it to convert different input formats to PDF: http://jugad2.blogspot.com/search/label/xtopdf

HTH, Vasudev Ram

Schoolfellow answered 7/1, 2014 at 18:1 Comment(5)
The DOCX conversion on xtopdf appears to extract the text only and strips formatting. Not amazingly useful.Hua
@fatuhoku: Yes, it does just that. And that is what "some of the conversions have limitations," implies - as should be somewhat obvious if you had read my comment. I rely on libraries for most of the input format conversions, so if they have limitations, so does xtopdf in those cases. Straightforward. Also, not everything has to be "amazingly useful". Just "useful" is good enough for very many use cases - along with some tweaking with custom code or by hand, even. Happens all the time in real life.Schoolfellow
Hey @Vasudev didn't mean to put down your project. It's true that I didn't read your whole answer. Too late to edit my comment. With a name like xtopdf, saying that it "doesn't do the full range of conversions" is actually an understatement, which prompted my comment for posterity.Hua
No it isn't an understatement, because the x in the name stands for "solve for x" - which implies, like math equations involving x, that there may not be solutions for some values of x, or there may be, but they are not yet found - or not yet worked on :) Also, you admitted you didn't read my whole answer; and now you are changing the topic from one of those quoted phrases to another in midstream.Schoolfellow
Also, the two phrases you quoted (from my answer), occur in the SECOND sentence of my answer (not somewhere much later). So, not only did you not read my whole answer, you did not even read the second sentence before commenting on it. And I even said "it may be of use" - not "will be of use" or "amazingly useful". So you are being overly critical without doing your homework - which is common on the Internet.Schoolfellow
G
-1

For doc and docx (but not ppt/pptx), you could try our independent (but commercial) high fidelity rendering engine online at OnlineDemo/docx_to_pdf

By "high fidelity", I mean it is designed from the ground up to have the same line and paragraph breaks, tab stops etc etc as Microsoft Word.

Garganey answered 14/2, 2015 at 20:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.