LibreOffice convert .docx to .pdf in parallel not working well
Asked Answered
T

1

4

I have a lot of docx files to be converted to pdf. Converting them one by one takes long time. So I write a python scripts to convert them in parallel:

from subprocess import Popen
import time
import os

os.chdir(os.path.dirname(__file__))

output_dir = './outputs'
source_file_format = './docs/example_{}.docx'

po_list = [Popen(
    f"/Applications/LibreOffice.app/Contents/MacOS/soffice --invisible --convert-to pdf --outdir {output_dir} {source_file_format.format(i)}",
    shell=True)
    for i in range(0, 7, 1)]

while po_list:
    time.sleep(0.01)
    for i, p in enumerate(po_list):
        status = p.poll()
        if status is None:
            continue
        elif status == 0:
            print('Succeed: [{}] {} -> {}'.format(p.returncode, p.stderr, p.args))
            po_list.remove(p)
        else:
            print('Failed: {} : {}'.format(p.args, p.poll()))
            po_list.remove(p)

But each time I run this script, only a part of docx files are converted successfully. The rest conversion processes even not throw any error info.

Trickster answered 20/3, 2021 at 8:24 Comment(2)
To see where the failure happens, replace the call of LibreOffice with a script that mimicks the conversion (write something into the output directory and use some time), and check the result. If all files are there, it seems to be a problem with LibreOffice. If files are missing, it's your script. -- If it's LibreOffice, I would make sure by this: open multiple shells, prepare a command line in each of them, and then start all of them as fast as possible.Elysia
I have the same issue. I test a simple function and the parallel part is working, When i try with LibreOffice I have issue from joblib import Parallel, delayed import os def convert_docdocx_to_pdf(file_to_convert : str, output_folder : str ): """Convert a doc or docx document to pdf using Libre Office""" result = subprocess.call(['lowriter', '--convert-to', 'pdf', '--outdir', output_folder, file_to_convert]) return result Parallel(n_jobs = 2, prefer = "threads", timeout = 60)(delayed(convert_docdocx_to_pdf)(file, os.path.dirname(file)) for file in files)Loreanloredana
N
2

We were also stuck on the same issue for some time.

Multiple Instances of LibreOffice shares the same space using a UserInstallation directory and thus parallel conversion was creating a problem here (The intermittent processes seem to get mixed up).

Using a different directory for each instance of libre helped to solve this issue. You may achieve this via UserInstallation env variable which can be passed as: "-env:UserInstallation=file:///d:/tmp/p0/"

You may automate this by appending your loop variable or any unique identifier in the directory.

Reference: https://ask.libreoffice.org/en/question/42975/how-can-i-run-multiple-instances-of-sofficebin-at-a-time/

Necropsy answered 17/5, 2021 at 17:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.