People that send content to my website use Word, so I get a lot of Word documents to convert to HTML. I want to conserve only the basic formatting - headings, lists and emphasis - no images.
When I convert them with Libre Office "Save as HTML", the resulting files are huge, for example, a doc file of 112K becomes 450K HTML, most of it useless FONT and SPAN tags (for some reason, every single punctuation mark is enclosed in its own span!).
I tried this script: based on tidy and sed, and it reduced the size to about 150K, but there are still many useless SPANs.
I tried to copy and past into Kompozer - an HTML editor, and then save as HTML; but it converted all my non-Latin (Hebrew) letters to entities such as "ְ", which increased the size to 750K!
I tried docvert: but found out that it requires a python library that requires another libraries, etc., which seems like an endless route of dependencies...
Is there a simple way to create clean HTML from Office documents?