Having trouble using Python and LibreOffice to convert pdf to docx and doc to docx
Asked Answered
I

2

1

I have spent a good amount of time trying to determine what is going wrong exactly, with the code I am using to convert pdf to docx (and doc to docx) using LibreOffice.

I have used both the windows run interface to test-run some of the code I have found to be relevant, and have tried on python as well, neither of which works.

I have LibreOffice v6.0.2 installed on windows.

I have been using variations of this code to attempt to convert some pdf files to docx of which the specific pdf file is not really relevant:

    import subprocess
    lowriter='C://Program Files/LibreOffice/program/swriter.exe'
    subprocess.run('{} --invisible --convert-to docx --outdir "{}" "{}"'
                   .format(lowriter,'dir',
                                
    'filepath.pdf',),shell=True)

I have tried code, again, in both the run interface on the windows os, and through python using the above code, with no luck. I have tried without the outdir as well, just in case I was writing that incorrectly, but always get a return code of 1:

    CompletedProcess(args='C://Program Files/LibreOffice/program/swriter.exe 
    --invisible --convert-to docx --outdir "{dir}" 
    {filepath.pdf}"', returncode=1)

The dir and filepath.pdf are place holders I have put.

I have a similar problem with the doc to docx conversion.

Ideality answered 9/4, 2018 at 18:18 Comment(2)
So it's libreoffice that doesn't work (either it doesn't support this functionality from cmdline, either you don't know how to call it). I'd suggest try accomplishing (if possible) your task from cmd, and only then go to the next step: wrapping that from another language.Larynx
Hmmm yes, perhaps that should of been clearer in the description, but given the various threads I've seen on the topic-- it should be working. Another linkIdeality
H
4

There are a number of problems here. You should first get the --convert-to call to work from the command line as @CristiFati commented, and then implement in python.

Here is the code that works on my system. No // in the path, and quotes are needed. Also, the folder is LibreOffice 5 on my system.

import subprocess
lowriter = 'C:/Program Files (x86)/LibreOffice 5/program/swriter.exe'
subprocess.run(
    '"{}" --convert-to docx --outdir "{}" "{}"'
    .format(lowriter,'dir', 'filepath.doc',), shell=True)

Finally, it looks like converting from PDF to DOCX is not supported. LibreOffice Draw can open a PDF file and save as ODG format.

EDIT:

Here is working code to convert from PDF. I upgraded to LO 6, so the version number ("LibreOffice 5") is no longer required in the path.

import subprocess
loffice = 'C:/Program Files/LibreOffice/program/soffice.exe'
subprocess.run(
    '"{}" --convert-to odg --outdir "{}" "{}"'
    .format(loffice,'dir', 'filepath.pdf',), shell=True)

filepath.odg

Hydrazine answered 10/4, 2018 at 21:24 Comment(4)
You mean you were able to convert to ODG format with the code, or it just did not throw any errors? Yeah I am not having luck with command line --convert-to either, although according to the Libreoffice documentation it should be working.Ideality
Thank you very much for the help. This worked, although to be honest does not help me too much, although it does determine the problem being the conversion to doc/docx--although that is the format I needed most. Any ideas why it is suggested as possible in the links I posted above? And do you know of any ways to further convert this odg to namely docx now? This is just hopeful wishing on my part as extracted tables from docx is extremely simple (as I've tested on some of the manually converted pdfs)Ideality
The comments by Steeve and Michael Peter on the first link you posted suggest that it no longer works. Presumably, it worked in an earlier version.Hydrazine
Confirmed: Apache OpenOffice 4.1.3 can convert to word processing formats. However, the output is not good. You're probably better off with ODG using LibreOffice.Hydrazine
B
0

Install pdf2docx package in python

source      = r'C:\Users\sdDesktop\New Project/Document2.pdf'
destination = r'C:\Users\sd\Desktop\New Project/sample_6.docx'

def Converter_pdf2docx(source,destination):
    pdf_file  = source
    docx_file = destination
    cv = Converter(pdf_file)
    cv.convert(docx_file, start=0, end=None)
    cv.close()
Beavers answered 28/1, 2021 at 9:16 Comment(1)
lib pdf2docx. This library is convenient for conversion but the problem is format corruption. The format after the conversion is changed compared to the PDF version.Unlade

© 2022 - 2024 — McMap. All rights reserved.