Headless LibreOffice very slow to export to PDF on Windows (6 times slow than on Linux)
Asked Answered
T

1

2

I often need to export many (> 1000) .docx documents to PDF with LibreOffice. Here is a sample document: test.docx. The following code works but it's quite slow on Windows (3.3 seconds on average for each PDF document):

import subprocess, docx, time   # first do: pip install python-docx 
for i in range(10):
    doc = docx.Document('test.docx')
    for paragraph in doc.paragraphs:
        paragraph.text = paragraph.text.replace('{{num}}', str(i))
    doc.save('test%i.docx' % i)   # these 4 previous lines are super fast - a few ms
    t0 = time.time()
    subprocess.call(r'C:\Program Files\LibreOffice\program\soffice.exe --headless --convert-to pdf test%i.docx --outdir . --nocrashreport --nodefault --nofirststartwizard --nolockcheck --nologo --norestore"' % i)
    print('PDF generated in %.1f sec' % (time.time()-t0))

    # for linux:
    # (0.54 seconds on average, so it's 6 times better than on Windows!)
    # subprocess.call(['/usr/bin/soffice', '--headless', '--convert-to', 'pdf', '--outdir', '/home/user', 'test%i.docx' % i])  

How to speed up this PDF export on Windows?

I suspect much time to be wasted on "Start LibreOffice/Writer, (do the job), Close LibreOffice" "Start LibreOffice/Writer, (do the job), Close LibreOffice" "Start LibreOffice/Writer, (do the job), Close LibreOffice" etc.

Notes:

  • As a comparison: here: https://bugs.documentfoundation.org/show_bug.cgi?id=92274 the export time is said to be either 90ms or 810ms.

  • soffice.exe replaced by swriter.exe: same problem: 3.3 second on average

    subprocess.call(r'C:\Program Files\LibreOffice\program\swriter.exe --headless --convert-to pdf test%i.docx --outdir ."' % i)
    
Tubate answered 26/4, 2020 at 20:30 Comment(1)
same on macOS and LinuxQuestionnaire
T
5

Indeed, all the time is wasted in starting/quitting LibreOffice. We can instead pass many docx documents in one call of soffice.exe:

import subprocess, docx
for i in range(1000):
    doc = docx.Document('test.docx')
    for paragraph in doc.paragraphs:
        paragraph.text = paragraph.text.replace('{{num}}', str(i))
    doc.save('test%i.docx' % i)

# all PDFs in one pass:
subprocess.call(['C:\Program Files\LibreOffice\program\swriter.exe', 
    '--headless', '--convert-to', 'pdf', '--outdir', '.'] + ['test%i.docx' % i for i in range(1000)])

107 seconds total, so it's ~ 107 ms on average per PDF, far better!

Notes:

  • It does not work with 10,000 documents because the length of the command line arguments would exceed 32k characters as explained here

  • I wonder if it's possible to have a more interactive way to work with LibreOffice headless:

    • start Writer headless, keep it started
    • send an action like open test1.docx to this process
    • send action export to pdf, and close docx
    • send open test2.docx, then export, etc.
    • ...
    • quit Writer headless

       

    This works with COM (Component Object Model) with MS Office: .doc to pdf using python but I wonder if something similar exists with LibreOffice. The answer seems to be no: Does LibreOffice/OpenOffice Support the COM Model

Tubate answered 27/4, 2020 at 9:51 Comment(3)
Posted a solution that handles multiple documents without restarting soffice: https://mcmap.net/q/218998/-how-to-use-libreoffice-api-uno-with-python-windows.Kozloski
You can start LibreOffice as a services and use it without starting/quitting time. github.com/unoconv/unoserverGuffey
@Guffey This is really a good idea, can you post this as a new answer? I think it would be interesting for future reference.Tubate

© 2022 - 2024 — McMap. All rights reserved.