Converting embedded Excel objects from a docx file into images
Asked Answered
D

1

8

I am using pandoc (via pypandoc) to convert docx files into markdown, on a non-Windows machine. Those files can contain images, but also other embedded objects.

pandoc is actually able to translate embedded Powerpoint presentations (into EMF files), but it is not able to process Excel objects (it ignores them). The aim would be to use python to convert those embedded Excel objects into images, so that they can be displayed as part of e.g. an HTML output.

It would be OK to use components written in another language (e.g. bash scripts) as long as they can be wrapped with a python API.

I realize this may be a tall order on a non-Windows platform (i.e. without the Microsoft libraries e.g. win32com). Has anyone had any success with this, or any educated guess on how to proceed?

What is the cell area to be displayed?

The core issue with all embedded objects is to determine what part of them should be displayed as this is a core functionality.

There must be a way to determine which cells are to be displayed, since that information is available to Word when it reads the contents the docx file.

This is the crux of the question. If the practical algorithm cannot take this into account, the answer will still be accepted, as long as it provides a way to extract that information.

Some clues might be found on this page.

Notes

Following a suggestion to explore the structure of the file itself, here is what I have observed: if you create a simple docx document (Mydoc.docx) with an embedded Excel file, you can examine its content by making a copy of the docx file (renaming it with a .zip extension) and unzipping it.

  • the text itself is contained in Mydoc/word/document.xml
  • the Excel file is contained in Mydoc/word/embeddings/Excel_Sheet_1.xlsx(or something of the sort).

If that is the route to go, then the problem is split in two:

  1. Convert Excel_Sheet_1.xlsx into an image (and how do you know what sheet and cells area are to be part of the image?).
  2. Tweak document.xml so that it says "point to the image" instead of pointing to an embedded file.

OOXML is rather complicated, especially when you try to do something as "elementary" as what I am trying to do... Has anyone gone there from a Unix platform and come back with something sensible?

Ddene answered 15/12, 2017 at 9:52 Comment(6)
this might get you started: #43154939, maybe use it in a pandoc python filter: see pandoc.org/scripting.htmlMarbles
Alas, that might work on Windows, but infortunately the win32com library is not availabe on other platforms.Ddene
I'm pretty sure you could use the general python file API instead. The docx is simply a zip file which you unzip and then work with the xml files inside...Marbles
I guess that would be the starting point.Ddene
@Ddene - Do you have any feedback on the help you obtained?Chlorinate
It certainly helped to clarify the problem, to lay out what we already knew and to circumscribe it. But we haven't found the "breakthrough" yet.Ddene
C
2

As you mention in the OP, I would go the "disassembly-assembly" way of mydoc.docx, i.e.:

  1. Extract the Excel worksheet from mydoc.docx. I will assume it is an embedded worksheet, it can surely be easily adapted to the case where the worksheet is a linked external xlsx. In my case the worksheet is located at word\embeddings\Microsoft_Excel_Worksheet1.xlsx inside the docx structure. As you said, one way would be to copy mydoc.docx into mydoc.zip, and extract Microsoft_Excel_Worksheet1.xlsx from the mydoc.zip structure.

  2. Convert Microsoft_Excel_Worksheet1.xlsx into an image. This seems not to be a simple task under Linux, due to the lack of win APIs. For instance, excel2img requires pywin32. A workaround is to use unoconv to convert the xlsx into a suitable format. The options here are numerous. Note that:

    1. You may need to run it as an external command, from within python. This is not an issue, but your python script should determine the host OS, and then decide whether to use unoconv (for Linux) or a more "standard" solution (for Windows, outside the scope of the OP). Note that unoconv is written in python, so perhaps you can integrate it somehow in your script.

    2. Bugs were reported for unoconv when exporting to png, e.g.. You might need to carry out the export to the target format in two steps, to pdf and then convert to png/jpg, e.g. with convert. This may vary across versions. In my version, the only graphic format that spreadsheets can be exported is pdf, so the two-step conversion becomes mandatory. Note that you probably have to use the -crop option of convert, since the pdf exporting generates whole pages.

    3. You will have to install unoconv in your system.

    4. You can choose the page range to be exported, as in
      unoconv -f pdf -d spreadsheet -e PageRange=1-1 Microsoft_Excel_Worksheet1.xlsx
      As far as I tried, the whole non-empty cells range is exported, and there is no possibility of exporting part of it with unoconv. A posible workaround for this is using openpyxl to fold cell ranges that you do not want to show up, and then export.

This is the essence of the question ("The aim would be to use python to convert those embedded Excel objects into images.")

  1. Replace Microsoft_Excel_Worksheet1.xlsx by the created image.

Note: This is a list of pyhton modules that can perform various operations on Excel worksheets.

pyExcelerator (apparently not maintained anymore)

xlwt (a fork of pyExcelerator)

openpyxl

Chlorinate answered 16/1, 2018 at 9:3 Comment(10)
Thanks, that's indeed a possible path (unoconv is a possibility). One crucial part of the answer is however missing: there must be a way to determine what cells are to be displayed, since that information is available to Word when it reads the contents the docx file. I would be happy to hand out the bounty (and kudos) to whoever can solve this.Ddene
@Ddene Could you add an DOCX example of your comment above?Multivibrator
@SpenceWetjen Thanks a lot. If will reduce the requirement: I will accept your answer even if it cannot exploit this cell area information, as long as it provides a way to extract it.Ddene
@Ddene - You may have now a complete solution for the exporting issue (complex, but I guess it could be tamed). See updated answer.Chlorinate
@SpenceWetjen Answering that there is no answer could be an answer in some cases (typically if you proved that the answer is mechanically impossible, or that the question is not decidable or the question was not useful in the first place). Do you really feel we are in those cases?Ddene
@Ddene - No, on the contrary. As said, there is now a solution that accomplishes what you mean, in several steps. PS: You are addressing comments to SpenceWetjen instead of me (perhaps using "s"+tab completion).Chlorinate
Sorry: attributing a bounty is not mandatory. Out of fairness, I do not attribute the bounty (but your commendable efforts have at least been rewarded by 2 approvals, one of which is mine) .Ddene
@Ddene - Of course attributing a bounty is not mandatory, if it does not address the question in point. Above and beyond rewards, I simply wonder, for this case and future readers... I meant to rescue the essence ("The aim would be to use python to convert those embedded Excel objects into images.", further reinforced "as long as it provides a way to extract it") and provide a solution, which it seems not accomplished. As feedback from your side, did the solution not work for the request? What was the failure point, the error message, etc.?Chlorinate
Let us continue this discussion in chat.Ddene
@Ddene - unoconv automatically determines the whole range to be exported in a given worksheet. If that is not what you are aiming for, I guess it would be good to have a specific example of a docx (as mentioned by Spence Wetjen) with its corresponding embedded workbook, what you see in your screen, and what you mean to get in your image. So one can see what is missing.Chlorinate

© 2022 - 2024 — McMap. All rights reserved.