I am using pandoc (via pypandoc) to convert docx files into markdown, on a non-Windows machine. Those files can contain images, but also other embedded objects.
pandoc is actually able to translate embedded Powerpoint presentations (into EMF files), but it is not able to process Excel objects (it ignores them). The aim would be to use python to convert those embedded Excel objects into images, so that they can be displayed as part of e.g. an HTML output.
It would be OK to use components written in another language (e.g. bash scripts) as long as they can be wrapped with a python API.
I realize this may be a tall order on a non-Windows platform (i.e. without the Microsoft libraries e.g. win32com
). Has anyone had any success with this, or any educated guess on how to proceed?
What is the cell area to be displayed?
The core issue with all embedded objects is to determine what part of them should be displayed as this is a core functionality.
There must be a way to determine which cells are to be displayed, since that information is available to Word when it reads the contents the docx file.
This is the crux of the question. If the practical algorithm cannot take this into account, the answer will still be accepted, as long as it provides a way to extract that information.
Some clues might be found on this page.
Notes
Following a suggestion to explore the structure of the file itself, here is what I have observed: if you create a simple docx document (Mydoc.docx
) with an embedded Excel file, you can examine its content by making a copy of the docx file (renaming it with a .zip extension) and unzipping it.
- the text itself is contained in
Mydoc/word/document.xml
- the Excel file is contained in
Mydoc/word/embeddings/Excel_Sheet_1.xlsx
(or something of the sort).
If that is the route to go, then the problem is split in two:
- Convert
Excel_Sheet_1.xlsx
into an image (and how do you know what sheet and cells area are to be part of the image?). - Tweak
document.xml
so that it says "point to the image" instead of pointing to an embedded file.
OOXML is rather complicated, especially when you try to do something as "elementary" as what I am trying to do... Has anyone gone there from a Unix platform and come back with something sensible?