Python: Access embedded OLE from Office/Excel document without clipboard
Asked Answered
F

4

14

I want to add and extract files from an Office/Excel document using Python. So far adding things is easy but for extracting I haven't found a clean solution.

To make clear what I've got and what not I've written the small example test.py below and explain further.

test.py

import win32com.client as win32
import os 
from tkinter import messagebox
import win32clipboard

# (0) Setup
dir_path = os.path.dirname(os.path.realpath(__file__))
print(dir_path)
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(dir_path + "\\" + "test_excel.xlsx")
ws = wb.Worksheets.Item(1)
objs = ws.OLEObjects()

# (1) Embed file
f = dir_path + "\\" + "test_txt.txt"
name = "test_txt_ole.txt"
objs.Add( Filename=f, IconLabel=name )

# (2) Access embedded file
obj = objs.Item(1) # Get single OLE from OLE list
obj.Copy()
win32clipboard.OpenClipboard()
data = win32clipboard.GetClipboardData(0xC004) # Binary access
win32clipboard.EmptyClipboard()
win32clipboard.CloseClipboard()
messagebox.showinfo(title="test_txt_ole.txt", message=str(data))

# (3) Press don't save here to keep 
# wb.Close() # Will close excel document and leave excel opened.
excel.Application.Quit() # Will close excel with all opened documents

For preparation (step 0) it opens a given excel document with one worksheet that was create before by using new document button in excel.

In step (1) it uses API to embed a given text file to the excel document. The text file was created before with content "TEST123" using a text editor.

Afterwards in step (2) it tries to read back content from embedded OLE using clipboard and opens a message box that shows the content from OLE in clipboard.

Finally (3) the program closes the opened document. To keep an unchanged setup press no here.

The big disadvantage of this solution is the use of clipboard which smashes any user content in clipboard which is bad style in a productive environment. Further it uses an undocumented option for clipboard.

A better solution would be to safe OLE or OLE embedded file to a python data container or to a file of my choice. In my example I've used a TXT file to easily identify file data. Finally I'll use ZIP for an all-in-one solution but a TXT file solution would be sufficient for base64 data.

Source of 0xC004 = 49156: https://danny.fyi/embedding-and-accessing-a-file-in-excel-with-vba-and-ole-objects-4d4e7863cfff

This VBA example look interesting but I have no clue about VBA: Saving embedded OLE Object (Excel workbook) to file in Excel 2010

Forgiven answered 1/4, 2017 at 8:14 Comment(8)
So, I am finding the question confusing. It looks like you want to extract something from an excel file, but I am finding what the something is to be a bit vague.Siegler
I'm not familiar with python, but in fact, you're already doing a lot of "VBA". Workbooks.Open, Worksheets.Item, etc. are kindof VBA commands (technically they are IDispatch calls). Why don't you just try to call something like oEmbFile.Object.SaveAs fName in the VBA example directly from your python codeJackson
@Simon: Pyhton like VBA use the COM interface. I can use the following code line to look at the object API but there is no SaveAs only things like Activate and Copy. messagebox.showinfo(title="packBootstrap", message="Item obj:\n"+ str(type(obj)) + str(dir(obj)) ) messagebox.showinfo(title="packBootstrap", message="oleobj:\n"+ str(type(obj._oleobj_)) + str(dir(obj._oleobj_)) ) But there is nothing that brought me further also "obj._oleobj_" is a "PylDispatch" Object.Forgiven
@Stephen: My question is intentionally vague since I want to add binary data to an Excel without a deep protocol. The requirements are: Use Python. Use a COM like stragety (win32com). Be professional (=No "hacks"). I've also thought of base64 data in a Excel cell, but size is limited to 32k here. A chunked multi-cell solution is not professional since it requires a self made protocol.Forgiven
And what is the objects CLSID? print( obj.CLSID )Parke
print( str(obj.CLSID) ) Gives me: 000208A2-0000-0000-C000-000000000046Forgiven
Are you sure that your OLEObjects are Added via the method in your code (txt file -> OLEObjects.Add)? Do you insist on that (e.g. is it possible to open the txt files and insert them as Word Documents or new Worksheets Or new Shapes or as the Value of different Cells)?Parke
I asked the above because I found the following: If you insert txt files this way, then they will be included in a legacy format with a ProgID of 'Packager'. Retrieving the content involves very hard low-level COM programming (structured storage) and I guess it takes 100× times the effort required to write your current code. I would advise to redesign your solution, store your data in a different way. If you are not afraid of the low-level methods: social.msdn.microsoft.com/Forums/vstudio/en-US/…Parke
P
6

Well, I find Parfait's solution a bit hackish (in the bad sense) because

  • it assumes that Excel will save the embedding as a temporary file,
  • it assumes that the path of this temporary file is always the user's default temp path,
  • it assumes that you will have privileges to open files there,
  • it assumes that you use a naming convention to identify your objects (e.g. 'test_txt' is always found in the name, you can't insert an object 'account_data'),
  • it assumes that this convention is not disturbed by the operating system (e.g. it will not change it to '~test_tx(1)' to save character length),
  • it assumes that this convention is known and accepted by all other programs on the computer (no one else will uses names that contain 'test_txt').

So, I wrote an alternative solution. The essence of this is thef following:

  1. unzip the .xlsx file (or any other Office file in the new XML-based format, which is not password protected) to a temporary path.

  2. iterate through all .bin files inside the '/xxx/embeddings' ('xxx' = 'xl' or 'word' or 'ppt'), and create a dictionary that contains the .bin files' temporary paths as keys and the dictionaries returned from step 3 as values.

  3. extract information from the .bin file according to the (not very well documented) Ole Packager format, and return the information as a dictionary. (Retrieves the raw binary data as 'contents', not only from .txt but any file type, e.g. .png)

I'm still learning Python, so this is not perfect (no error checking, no performance optimization) but you can get the idea from it. I tested it on a few examples. Here is my code:

import tempfile
import os
import shutil
import zipfile
import glob
import pythoncom
import win32com.storagecon


def read_zipped_xml_bin_embeddings( path_zipped_xml ):
    temp_dir = tempfile.mkdtemp()

    zip_file = zipfile.ZipFile( path_zipped_xml )
    zip_file.extractall( temp_dir )
    zip_file.close()

    subdir = {
            '.xlsx': 'xl',
            '.xlsm': 'xl',
            '.xltx': 'xl',
            '.xltm': 'xl',
            '.docx': 'word',
            '.dotx': 'word',
            '.docm': 'word',
            '.dotm': 'word',
            '.pptx': 'ppt',
            '.pptm': 'ppt',
            '.potx': 'ppt',
            '.potm': 'ppt',
        }[ os.path.splitext( path_zipped_xml )[ 1 ] ]
    embeddings_dir = temp_dir + '\\' + subdir + '\\embeddings\\*.bin'

    result = {}
    for bin_file in list( glob.glob( embeddings_dir ) ):
        result[ bin_file ] = bin_embedding_to_dictionary( bin_file )

    shutil.rmtree( temp_dir )

    return result


def bin_embedding_to_dictionary( bin_file ):
    storage = pythoncom.StgOpenStorage( bin_file, None, win32com.storagecon.STGM_READ | win32com.storagecon.STGM_SHARE_EXCLUSIVE )
    for stastg in storage.EnumElements():
        if stastg[ 0 ] == '\1Ole10Native':
            stream = storage.OpenStream( stastg[ 0 ], None, win32com.storagecon.STGM_READ | win32com.storagecon.STGM_SHARE_EXCLUSIVE )

            result = {}
            result[ 'original_filename' ] = '' # original filename in ANSI starts at byte 7 and is null terminated
            stream.Seek( 6, 0 )
            while True:
                ch = stream.Read( 1 )
                if ch == '\0':
                    break
                result[ 'original_filename' ] += ch

            result[ 'original_filepath' ] = '' # original filepath in ANSI is next and is null terminated
            while True:
                ch = stream.Read( 1 )
                if ch == '\0':
                    break
                result[ 'original_filepath' ] += ch

            stream.Seek( 4, 1 ) # next 4 bytes is unused

            temporary_filepath_size = 0 # size of the temporary file path in ANSI in little endian
            temporary_filepath_size |= ord( stream.Read( 1 ) ) << 0
            temporary_filepath_size |= ord( stream.Read( 1 ) ) << 8
            temporary_filepath_size |= ord( stream.Read( 1 ) ) << 16
            temporary_filepath_size |= ord( stream.Read( 1 ) ) << 24

            result[ 'temporary_filepath' ] = stream.Read( temporary_filepath_size ) # temporary file path in ANSI

            result[ 'size' ] = 0 # size of the contents in little endian
            result[ 'size' ] |= ord( stream.Read( 1 ) ) << 0
            result[ 'size' ] |= ord( stream.Read( 1 ) ) << 8
            result[ 'size' ] |= ord( stream.Read( 1 ) ) << 16
            result[ 'size' ] |= ord( stream.Read( 1 ) ) << 24

            result[ 'contents' ] = stream.Read( result[ 'size' ] ) # contents

            return result

You can use it like this:

objects = read_zipped_xml_bin_embeddings( dir_path + '\\test_excel.xlsx' )
obj = objects.values()[ 0 ] # Get first element, or iterate somehow, the keys are the temporary paths
print( 'Original filename: ' + obj[ 'original_filename' ] )
print( 'Original filepath: ' + obj[ 'original_filepath' ] )
print( 'Original filepath: ' + obj[ 'temporary_filepath' ] )
print( 'Contents: ' + obj[ 'contents' ] )
Parke answered 14/4, 2017 at 10:19 Comment(0)
R
3

Consider using the Windows temp directory that will temporarily store the OLE Object's file source when embedded in workbook. No clipboard is used in this solution but physical files.

With this approach, you will need to retrieve the current user's name and iterate through all files of the temp directory: C:\Documents and Settings\{username}\Local Settings\Temp (standard Excel dump folder for Windows Vista/7/8/10). Also, a conditional like-name search with in is used that contains original file's basename as multiple versions with number suffixes (1), (2), (3),... may exist depending on how many times script runs. Try even a regex search here.

Finally, the below routine uses try...except...finally block to cleanly exist the Excel objects regardless of error but will output any exception message. Do note this is only a Windows solution using a text file.

import win32com.client as win32
import os, shutil
from tkinter import messagebox

# (0) Setup
dir_path = cd = os.path.dirname(os.path.abspath(__file__))
print(dir_path)

try:
    excel = win32.gencache.EnsureDispatch('Excel.Application')    
    wb = excel.Workbooks.Open(os.path.join(dir_path, "test_excel.xlsx"))
    ws = wb.Worksheets(1)
    objs = ws.OLEObjects()

    # (1) Embed file
    f = os.path.join(dir_path, "test_txt.txt")    
    name = "test_txt_ole.txt"
    objs.Add(Filename=f, IconLabel=name).Name = 'Test'

    # (2) Open file from temporary folder
    ole = ws.OLEObjects(1)        
    ole.Activate()

    # (3) Grab the recent like-named file
    user = os.environ.get('USERNAME')
    outfile = os.path.join(dir_path, "test_txt_out.txt")

    tempfolder = r"C:\Documents and Settings\{}\Local Settings\Temp".format(user)

    for subdir, dirs, files in os.walk(tempfolder):
        for file in sorted(files, reverse=True):
            if 'test_txt' in file:                
                tempfile = os.path.join(tempfolder, file)
                break

    shutil.copyfile(tempfile, outfile)

    # (4) Read text content
    with open(outfile, 'r') as f:        
        content = f.readlines()

    # (5) Output message with content
    messagebox.showinfo(title="test_txt_ole.txt", message="".join(content))

except Exception as e:
    print(e)

finally:
    wb.Close(True)      # CLOSES AND SAVES WORKBOOK
    excel.Quit          # QUITS EXCEL APP

    # RELEASES COM RESOURCES
    ws = None; wb = None; objs = None; ole = None; excel = None

Tkinter Messagebox

Message Output

Russi answered 8/4, 2017 at 15:5 Comment(4)
I'm short of time for today. So one short dumb question. Line "ole = ws.OLEObjects(1)" creates a temp file of the OLE in %TEMP%?Forgiven
Sorry, but as far as I understand, BREMI wants to retrieve the contents of the OLE Objects inside his Excel file. Your code gets the content of the original files, however, the OLE Objects in Excel may be different if the user modified them.Parke
No, ole=ws.OLEObject(1) assigns the first OLEObject in worksheet to that variable.Russi
@Parke - good point. I added an ole.Activate() which launches the embedded OLE object and in this act creates the temp file copied over to current directory to be read in python. To test, I modify the saved object with a Hamlet quote prior and commented out the Add in script to not add any new OLE object to workbook. My new text showed in Tkinter message: Test123 To be or not to be.Russi
F
2

I built a python module to do exactly this check it out over here. https://pypi.org/project/AttachmentsExtractor/ also the module can be run on any os.

after installing the library use the following code snippet Code:

 from AttachmentsExtractor import extractor
            
 abs_path_to_file='Please provide absolute path here '
 path_to_destination_directory = 'Please provide path of the directory where the extracted attachments should be stored'
 extractor.extract(abs_path_to_file,path_to_destination_directory) # returns true if one or more attachments are found else returns false.
Forde answered 15/10, 2020 at 11:2 Comment(0)
S
0

I've recently attempted to answer a similar question : can I extract embedded word documents from an excel file and save them to disk?

Adapting the answers on this page (and making use of the knowledge that an excel file is a zipped collection of, mainly XML, files) this can be easily performed:

  1. Create temporary file
  2. Extract all contents of excel file into the temporary folder.
  3. Find all embedded files
  4. Move the embedded files to a permanent folder of your choice.

Here is a snippet which does the above:

import zipfile
import tempfile
import os
import glob
import shutil
import sys

def extract_embedded_files(file_path,
                           save_path,
                           sub_dir='xl'):
    """
    Extracts embedded files from Excel documents, it takes advantage of
    excel being a zipped collection of files. It creates a temporary folder,
    extracts all the contents of the excel folder there and then moves the
    embedded files to the requested save_path.

    Parameters:
    ----------
    file_path : str, 
        The path to the excel file to extract embedded files from.
    
    save_path : str,
        Path to save the extracted files to.

    sub_dir : str,
        one of 'xl' (for excel), 'word' , or 'ppt'. 
    """

    # make a temporary directory 
    temp_dir = tempfile.mkdtemp()

    # extract contents excel file to temporary dir
    zip_file = zipfile.ZipFile(file_path)
    zip_file.extractall(temp_dir)
    zip_file.close()

    # find all embedded files and copy to save_path
    embeddings_dir = f'{temp_dir}/{sub_dir}/embeddings/'
    embedded_files = list(glob.glob(embeddings_dir+'*'))
    for file in embedded_files:
        shutil.copy(file, save_path)
Sidekick answered 9/4, 2021 at 13:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.