How to convert PDF to Word using Acrobat SDK? [closed]
Asked Answered
S

3

7

My .Net application needs to convert a PDF document to Word format programmatically.

I evaluated several products and found Acrobat X Pro, which gives a save as option where we can save the document in Word/Excel format. I tried to use Acrobat SDK but couldn't find proper documentation from where to start.

I looked into their IAC sample but couldn't understand how to call the menu item and make it execute the save as option.

Suckle answered 5/7, 2012 at 9:3 Comment(0)
E
15

You can do this with Acrobat X Pro, but you need to use the javascript API in c#.

 AcroPDDoc pdfd = new AcroPDDoc();
 pdfd.Open(sourceDoc.FileFullPath);
 Object jsObj = pdfd.GetJSObject();
 Type jsType = pdfd.GetType();
 //have to use acrobat javascript api because, acrobat
 object[] saveAsParam = { "newFile.doc", "com.adobe.acrobat.doc", "", false, false };
 jsType.InvokeMember("saveAs",BindingFlags.InvokeMethod | BindingFlags.Public | BindingFlags.Instance,null, jsObj, saveAsParam, CultureInfo.InvariantCulture);

Hope that helps.

Emmie answered 12/12, 2012 at 20:18 Comment(3)
Hi, I have dont the same thing.. thank you for your answer. but it seems that the process takes quite a lot of time to finish. If i have to cover 1000 files, it will take more than 5 6 hours.. is there a faster way for this?Ungainly
I added a pdfd.Close() at the end to unlock the file.Forsooth
Thanks for this! So useful. For those who are interested to export to excel simply change newFile.doc to newFile.xlsx and "com.adobe.acrobat.doc" to "com.adobe.acrobat.xlsx"Zeller
N
3

I did something very similar using WinPython x64 2.7.6.3 and Acrobat X Pro and used the JSObject interface to convert PDFs to DOCX. Essentially the same solution as jle's.

The following should be a complete piece of code that converts a set of PDFs to DOCX:

# gets all files under ROOT_INPUT_PATH with FILE_EXTENSION and tries to extract text from them into ROOT_OUTPUT_PATH with same filename as the input file but with INPUT_FILE_EXTENSION replaced by OUTPUT_FILE_EXTENSION
from win32com.client import Dispatch
from win32com.client.dynamic import ERRORS_BAD_CONTEXT

import winerror

# try importing scandir and if found, use it as it's a few magnitudes of an order faster than stock os.walk
try:
    from scandir import walk
except ImportError:
    from os import walk

import fnmatch

import sys
import os

ROOT_INPUT_PATH = None
ROOT_OUTPUT_PATH = None
INPUT_FILE_EXTENSION = "*.pdf"
OUTPUT_FILE_EXTENSION = ".docx"

def acrobat_extract_text(f_path, f_path_out, f_basename, f_ext):
    avDoc = Dispatch("AcroExch.AVDoc") # Connect to Adobe Acrobat

    # Open the input file (as a pdf)
    ret = avDoc.Open(f_path, f_path)
    assert(ret) # FIXME: Documentation says "-1 if the file was opened successfully, 0 otherwise", but this is a bool in practise?

    pdDoc = avDoc.GetPDDoc()

    dst = os.path.join(f_path_out, ''.join((f_basename, f_ext)))

    # Adobe documentation says "For that reason, you must rely on the documentation to know what functionality is available through the JSObject interface. For details, see the JavaScript for Acrobat API Reference"
    jsObject = pdDoc.GetJSObject()

    # Here you can save as many other types by using, for instance: "com.adobe.acrobat.xml"
    jsObject.SaveAs(dst, "com.adobe.acrobat.docx") # NOTE: If you want to save the file as a .doc, use "com.adobe.acrobat.doc"

    pdDoc.Close()
    avDoc.Close(True) # We want this to close Acrobat, as otherwise Acrobat is going to refuse processing any further files after a certain threshold of open files are reached (for example 50 PDFs)
    del pdDoc

if __name__ == "__main__":
    assert(5 == len(sys.argv)), sys.argv # <script name>, <script_file_input_path>, <script_file_input_extension>, <script_file_output_path>, <script_file_output_extension>

    #$ python get.docx.from.multiple.pdf.py 'C:\input' '*.pdf' 'C:\output' '.docx' # NOTE: If you want to save the file as a .doc, use '.doc' instead of '.docx' here and ensure you use "com.adobe.acrobat.doc" in the jsObject.SaveAs call

    ROOT_INPUT_PATH = sys.argv[1]
    INPUT_FILE_EXTENSION = sys.argv[2]
    ROOT_OUTPUT_PATH = sys.argv[3]
    OUTPUT_FILE_EXTENSION = sys.argv[4]

    # tuples are of schema (path_to_file, filename)
    matching_files = ((os.path.join(_root, filename), os.path.splitext(filename)[0]) for _root, _dirs, _files in walk(ROOT_INPUT_PATH) for filename in fnmatch.filter(_files, INPUT_FILE_EXTENSION))

    # patch ERRORS_BAD_CONTEXT as per https://mail.python.org/pipermail/python-win32/2002-March/000265.html
    global ERRORS_BAD_CONTEXT
    ERRORS_BAD_CONTEXT.append(winerror.E_NOTIMPL)

    for filename_with_path, filename_without_extension in matching_files:
        print "Processing '{}'".format(filename_without_extension)
        acrobat_extract_text(filename_with_path, ROOT_OUTPUT_PATH, filename_without_extension, OUTPUT_FILE_EXTENSION)
Naughty answered 28/10, 2014 at 4:4 Comment(2)
What would be the alternative to the dispatch module on a Mac?Dulosis
Getting (-2147221005, 'Invalid class string', None, None) error while using AvDoc = Dispatch("AcroExch.AVDoc") in python. Any help ??Ruwenzori
I
-2

Adobe doesn't support PDF to Word conversions, unless you're using their Acrobat PDF client. Maeaning you can't do it with their SDK nor by calling a command-line. You can only do it manually.

Impugn answered 7/11, 2012 at 9:50 Comment(1)
The solutions posted by either jle or me show ways to achieve this programmatically. If you have Acrobat X Pro, you can try out my script as it should work out of the box once you have installed WinPython x64 2.7.6.3 (which is free)Naughty

© 2022 - 2024 — McMap. All rights reserved.