From MS Word or Libre Office to clean HTML

Asked 24/1, 2013 at 7:26 Answered 6/8, 2015 at 19:46

html ms-word web-standards openoffice-writer

People that send content to my website use Word, so I get a lot of Word documents to convert to HTML. I want to conserve only the basic formatting - headings, lists and emphasis - no images.

When I convert them with Libre Office "Save as HTML", the resulting files are huge, for example, a doc file of 112K becomes 450K HTML, most of it useless FONT and SPAN tags (for some reason, every single punctuation mark is enclosed in its own span!).

I tried this script: http://www.techrepublic.com/blog/opensource/how-to-convert-doc-and-odf-files-to-clean-and-lean-html/3708 based on tidy and sed, and it reduced the size to about 150K, but there are still many useless SPANs.

I tried to copy and past into Kompozer - an HTML editor, and then save as HTML; but it converted all my non-Latin (Hebrew) letters to entities such as "ְ", which increased the size to 750K!

I tried docvert: https://github.com/holloway/docvert/issues/6 but found out that it requires a python library that requires another libraries, etc., which seems like an endless route of dependencies...

Is there a simple way to create clean HTML from Office documents?

Joellenjoelly answered 24/1, 2013 at 7:26 Comment(1)

This is probably a duplicate: #68464 – Joellenjoelly 26/1, 2013 at 17:51

I was using http://word2cleanhtml.com/ till i realised that MS Word itself gives the option to save document as HTML.

On selecting this, the .docx file becomes .html and is the best html version of a word doc that i've seen. Its certainly better than all these online tools.

Fulbert answered 28/9, 2013 at 7:17 Comment(0)

I realize this question is old but the other answers never really answered the question. If you are not adverse to writing some PHP code, the CubicleSoft Ultimate Web Scraper Toolkit has a class called TagFilter:

https://github.com/cubiclesoft/ultimate-web-scraper/blob/master/support/tag_filter.php

You pass in two things: An array of options and the data to parse as HTML.

For cleaning up broken HTML, the default options from TagFilter::GetHTMLOptions() will act as a good starting point. Those options form the basis of valid HTML content and, doing nothing else, will clean up any input data into something that another tool like Simple HTML DOM can correctly parse in a DOM model.

However, the other way to use the class is to modify the default options and add a 'callback' option to the options array. For every tag in the HTML, the specified callback function will be called. The callback is expected to return what to do with each tag, which is where the real power of TagFilter comes into play. You can keep any given tag and some or all of its attributes (or modifying them), get rid of the tag but keep the interior content, keep the tag but get rid of the content, modify the content (for closing tags), or get rid of both the tag and interior content. This approach allows extremely refined control over the most convoluted HTML out there and processes the input in a single pass. See the same repository's test suite for example usage of TagFilter.

The only downside is that the callback has to keep track of where it is at between each call whereas something like Simple HTML DOM selects things based on a DOM-like model. BUT that's only a drawback if the document being processed has things like 'id's and 'class'es...most Word/Libre HTML content does not, which means it is a giant blob of unrecognizable/unparseable HTML as far as DOM processing tools go.

Rozalie answered 11/4, 2015 at 21:38 Comment(0)

In your situation, you may need to go line-by-line to convert major parts of your word doc, then go back and cleanup any additional tags. If you don't mind this approach, then consider this solution...

After saving your word doc as an web page, open that same web page in Notepad++.
Then use the Replace feature for that document
Within the find what box, type in <[^>]+>
In the search mode for this same window, select "Regular expression"

Now all you have to do from that point is click Find Next until you get to the tags you want to replace and then click Replace for each tag that needs to be replaced. Make sure the "Replace with:" box is empty.

I don't know if there is a more convenient way, but this way is 100% Free and simple for HTML tag clean-up processing via Notepad++.

As far as converting inline-styles to external CSS (which I recommend as the second process after replacing unnecessary tags), try this app... http://inlinecssextractor.com/home.html

Good luck

Zela answered 24/1, 2013 at 20:57 Comment(1)

Using Notepad++ could be a solution for a single document, however, since I have new documents coming each week, I don't want to repeat the same replacements again and again for each document... – Joellenjoelly 25/1, 2013 at 5:34

I found these two cleaners quite effective. First, I ran the word filtered html through

http://textism.com/wordcleaner/

Then I used some regular expressions to convert some bulleted paragraph items to lists (li). Then I ran the result through

http://infohound.net/tidy/

to wrap the list items with unordered list (ul) tags and clean up other errors. I was very pleased with the result that went from 1.5M to 225k.

Focus answered 19/4, 2013 at 22:2 Comment(0)

Here is a set of PowerShell scripts that will clean Word-Filtered HTML and correctly tag super/subscripts about 95% of the time. (No, you can't get better than that, Word is made for print.)

https://github.com/suzumakes/replaceit

Instructions are there in the ReadMe and if you happen to encounter any additional characters that need to be caught or come up with any tweaks/improvements, I'd be happy to see your pull request.

Stalinsk answered 10/7, 2015 at 16:15 Comment(0)

ophir.php does a pretty nice job at making clean HTML from .odt files. You need a php hosting environment to run it.

Surfactant answered 6/8, 2015 at 19:46 Comment(0)

Recommended topics

Hot tags