multiple .doc to .docx file conversion using python
Asked Answered
A

6

12

I want to convert all the .doc files from a particular folder to .docx file.

I tried using the following code,

import subprocess
import os
for filename in os.listdir(os.getcwd()):
    if filename.endswith('.doc'):
        print filename
        subprocess.call(['soffice', '--headless', '--convert-to', 'docx', filename])

But it gives me an error: OSError: [Errno 2] No such file or directory

Abed answered 19/7, 2016 at 21:0 Comment(0)
L
22

Here is a solution that worked for me. The other solutions proposed did not work on my Windows 10 machine using Python 3.

from glob import glob
import re
import os
import win32com.client as win32
from win32com.client import constants

# Create list of paths to .doc files
paths = glob('C:\\path\\to\\doc\\files\\**\\*.doc', recursive=True)

def save_as_docx(path):
    # Opening MS Word
    word = win32.gencache.EnsureDispatch('Word.Application')
    doc = word.Documents.Open(path)
    doc.Activate ()

    # Rename path with .docx
    new_file_abs = os.path.abspath(path)
    new_file_abs = re.sub(r'\.\w+$', '.docx', new_file_abs)

    # Save and Close
    word.ActiveDocument.SaveAs(
        new_file_abs, FileFormat=constants.wdFormatXMLDocument
    )
    doc.Close(False)

for path in paths:
    save_as_docx(path)
Lagging answered 16/2, 2018 at 18:40 Comment(3)
I am getting this error --> com_error: (-2147352567, 'Exception occurred.', (0, 'Microsoft Word', "Sorry, we couldn't find your file. Was it moved, renamed, or deleted?\r (C:\\//Users/shreyajain/Documents/Docum...)", 'wdmain11.chm', 24654, -2146823114), None) Any suggestion?Mithraism
@Shreyansjain Based on the error message, I'm guessing you typed in the file path incorrectly. Although, it's difficult to tell without seeing your code.Lagging
1) This also allows you to convert PDF files into DOCX, allowing you to read the content of PDF documents. 2) I would suggest to add a TRY at the start of the program, to check that MS-Word is installed : MSWord_OK = True try: word = win32.gencache.EnsureDispatch('Word.Application')Incumbency
H
5

I prefer to use the glob module for tasks like that. Put this in a file doc2docx.py. To make it executable, set chmod +x. And optionally put that file in your $PATH as well, to make it available "everywhere".

#!/usr/bin/env python

import glob
import subprocess

for doc in glob.iglob("*.doc"):
    subprocess.call(['soffice', '--headless', '--convert-to', 'docx', doc])

Though ideally you'd leave the expansion to the shell itself, and call doc2docx.py with the files as arguments, like doc2docx.py *.doc:

#!/usr/bin/env python

import subprocess
import sys

if len(sys.argv) < 2:
    sys.stderr.write("SYNOPSIS: %s file1 [file2] ...\n"%sys.argv[0])

for doc in sys.argv[1:]:
    subprocess.call(['soffice', '--headless', '--convert-to', 'docx', doc])

As requested by @pyd, to output to a target directory myoutputdir use:

#!/usr/bin/env python

import subprocess
import sys

if len(sys.argv) < 2:
    sys.stderr.write("SYNOPSIS: %s file1 [file2] ...\n"%sys.argv[0])

for doc in sys.argv[1:]:
    subprocess.call(['soffice', '--headless', '--convert-to', 'docx', '--outdir', 'myoutputdir', doc])
Hodden answered 20/7, 2016 at 5:4 Comment(2)
From my tests this only fails when the working/target directory in question is the root of the filesystem, e.g. directly C:\ or D:\. Any other folder works fine. Looks like a bug in soffice. You can specify the output directory by using the option --outdir <directory-name>.Hodden
do i need to pass one more argument ?? can you edit your answerShiff
D
3

If you don't like to rely on sub-process calls, here is the version with COM client. It is useful if you are targeting windows users without LibreOffice installed.

#!/usr/bin/env python

import glob
import win32com.client

word = win32com.client.Dispatch("Word.Application")
word.visible = 0

for i, doc in enumerate(glob.iglob("*.doc")):
    in_file = os.path.abspath(doc)
    wb = word.Documents.Open(in_file)
    out_file = os.path.abspath("out{}.docx".format(i))
    wb.SaveAs2(out_file, FileFormat=16) # file format for docx
    wb.Close()

word.Quit()
Duce answered 1/2, 2018 at 23:37 Comment(3)
It is clean. However, i wonder is there any platform-independent way to convert doc into docx?Whippet
@Whippet There is no swiss knife library to take care of this when I looked last year. One possible method is to detect OS with sys.platform and try Jan Christoph Terasa's approach for Linux, my approach for Windows. Not sure what works for Mac.Duce
Just tried unoconv with this docker image, doc -> docx, but the resulting docx was damaged (files contained comments that I needed to preserve).Nicholnichola
R
2

based on dshefman's code,

import re
import os
import sys
import win32com.client as win32
from win32com.client import constants

# Get path from command line argument
ABS_PATH = sys.argv[1]

def save_as_docx(path):
    # Opening MS Word
    word = win32.gencache.EnsureDispatch('Word.Application')
    doc = word.Documents.Open(path)
    doc.Activate ()

    # Rename path with .docx
    new_file_abs = os.path.abspath(path)
    new_file_abs = re.sub(r'\.\w+$', '.docx', new_file_abs)

    # Save and Close
    word.ActiveDocument.SaveAs(new_file_abs, FileFormat=constants.wdFormatXMLDocument)
    doc.Close(False)

def main():
    source = ABS_PATH

    for root, dirs, filenames in os.walk(source):
        for f in filenames:
            filename, file_extension = os.path.splitext(f)

            if file_extension.lower() == ".doc":
                file_conv = os.path.join(root, f)
                save_as_docx(file_conv)
                print("%s ==> %sx" %(file_conv,f))

if __name__ == "__main__":
    main()
Ramberg answered 12/5, 2020 at 2:6 Comment(0)
E
1

Use os.path.join to specify the correct directory.

import os, subprocess

main_dir = os.path.join('/', 'Users', 'username', 'Desktop', 'foldername')

for filename in os.listdir(main_dir):
    if filename.endswith('.doc'):
        print filename
        subprocess.call(['soffice', '--headless', '--convert-to', 'docx', filename])
Explanatory answered 19/7, 2016 at 21:4 Comment(0)
B
0

This version uses doc2docx, which I believe only works on windows or mac. I believe this is the cleanest version so far, if you can use windows. To use it, you must install doc2docx first, which can be done from anaconda (or pip).

import doc2docx
from glob import glob
import os
def convert_doc_to_docx(folder):
    # Stores all doc files to be removed later
    doc_files = glob('{}/*.doc'.format(folder))
    
    # Now do the conversion. Note that doc2docx converts all files in a given folder
    doc2docx.convert(folder)

    # Remove all old doc_files
    for doc_file in doc_files:
        os.remove(doc_file)

convert_doc_to_docx('C:/Users/user/folder_containing_doc_files/')
Browning answered 22/1 at 14:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.