Convert Word doc to HTML programmatically in Java
Asked Answered
R

11

24

I need to convert a Word document into HTML file(s) in Java. The function will take input an word document and the output will be html file(s) based on the number of pages the word document has i.e. if the word document has 3 pages then there will be 3 html files generated having the required page break.

I searched for open source/non-commercial APIs which can convert doc to html but for no result. Anybody who have done this type of job before please help.

Thanks

Renaldo answered 22/10, 2008 at 19:36 Comment(2)
Here are some starting points for you. Good luck. On Microsoft's website, you can find documentation for the .doc format, and on the ECMA website, the .docx format. Microsoft has a category for Java on their OpenXML developer blog, including a post specifically about converting OpenXML to XHTML in Java.Bricebriceno
theserverside.com/news/thread.tss?thread_id=41942#216880 -- this has worked quite well for me earlierChauchaucer
G
3

We use tm-extractors (http://mvnrepository.com/artifact/org.textmining/tm-extractors), and fall back to the commercial Aspose (http://www.aspose.com/). Both have native Java APIs.

Gautious answered 22/10, 2008 at 20:43 Comment(0)
I
7

I recommend the JODConverter, It leverages OpenOffice.org, which provides arguably the best import/export filters for OpenDocument and Microsoft Office formats available today.

JODConverter has a lot of documents, scripts, and tutorials to help you out.

Illume answered 23/6, 2011 at 9:21 Comment(0)
J
4

I've used the following approach successfully in production systems where the new MS Word XML format isn't available:

Spawn a process that does something similar to:

http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html

You'd probably want to start openoffice up once at startup of your program, and call the python script as many times during your program that you need to (with some sort of checking to ensure the ooffice process is always there).

The other option is to spawn the following sort of command every time you need to do the conversion:

ooffice -headless "macro://<path to ooffice vb macro to convert, with parameter pointing to file>"

I've used the macro approach multiple times and it works well (sorry, I don't have the macro code available).

While there are mechanisms for doing it via MS Word, they're not easy from Java, and do require other support programs to drive MS Word via OLE.

I've used abiword before too, which works well for many documents, but does get confused with more complex documents (ooffice seems to handle everything I've thrown at it). Abiword has a slightly easier command line interface for conversion than ooffice.

Jumbala answered 22/10, 2008 at 20:31 Comment(0)
G
3

We use tm-extractors (http://mvnrepository.com/artifact/org.textmining/tm-extractors), and fall back to the commercial Aspose (http://www.aspose.com/). Both have native Java APIs.

Gautious answered 22/10, 2008 at 20:43 Comment(0)
S
2

If its a docx, you could use docx4j (ASL v2). This uses XSLT to create the HTML.

However, it will give you a single HTML for the whole document.

If you wanted an HTML per page, you could do something with the lastRenderedPageBreak tag that Word puts into the docx (assuming you used Word to create it).

Shoshana answered 27/3, 2009 at 1:56 Comment(0)
C
2

It is easier to do this in the new MS word docx as the format is in XML. You can use an XSL to transform the Word doc in XML format to an HTML format.

If however your Word doc is in an old version, you can use POI library http://poi.apache.org/ and then access that and generate a Java object and from that point on you can easily convert it to an HTML format using an HTML java library

http://www.dom4j.org/dom4j-1.4/apidocs/org/dom4j/io/HTMLWriter.html

Compromise answered 10/4, 2009 at 21:2 Comment(1)
As of version 3.5, Apache POI supports newer versions of Word.Entail
M
1

I see this thread turns up in external links and has the occasional post so I thought I'd post an update (hope no one minds). OpenOffice continues to evolve and release 3.2 improves the word import export filters again. OpenOffice and Java can run on many platforms so Java systems can make use of the OpenOffice UNO API directly to import/manipulate/export documents in many formats (including word and pdf) or use a library like JODReports or Docmosis to facilitate. Both have free/open options.

Marmot answered 11/6, 2010 at 14:9 Comment(0)
V
1

I tried this way and its work with me from this site http://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML

This only work with docx to convert it into html included images inside that word document.

    // 1) Load DOCX into XWPFDocument
    InputStream doc = new FileInputStream(new File("c:/document.docx"));
    XWPFDocument document = new XWPFDocument(doc);

   // 2) Prepare XHTML options (here we set the IURIResolver to load images from a "word/media" folder)
            XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;

            // 3) Extract image
            String root = "target";
            File imageFolder = new File( root + "/images/" + doc );
            options.setExtractor( new FileImageExtractor( imageFolder ) );
            // 4) URI resolver
            options.URIResolver( new FileURIResolver( imageFolder ) );


            OutputStream out = new FileOutputStream(new File("c:/document.html"));
            XHTMLConverter.getInstance().convert(document, out, options);

I hope this solve your issue

Virgule answered 15/6, 2015 at 7:15 Comment(1)
Please share the solution itself instead of posting a link to it.Subscript
C
0

You'd have to find the MS word doc specification ( since it is basically a binary dump of whatever is in word at that point in time ), and slowly go through it element by element converting ms word "objects/states" to the html equiv. you might be able to find a script to do it for u since this really isn't fun work and i'd advise against it ( converting file formats or even reading from commercial files on your own is always hard and often incomplete ). PS: just google doc2html

Cox answered 22/10, 2008 at 19:48 Comment(2)
Have you ever looked at the specification? (Scratch that, have you ever investigated the inconsistencies between the .rtf file containing the spec, to the specified format?) -- This is unfeasible, way, way too much work while there are other solutions available.Antibody
i did say it was hard and specifications were often incomplete, and advised against it.Cox
O
0

If you are targeting word 2007 files using the ooxml format then this article might help. And there is the Ooxml4j project which is implementing ooxml for Java library.

If you are targeting the binary files though...thats another problem.

Obbligato answered 22/10, 2008 at 20:20 Comment(0)
P
0
import officetools.OfficeFile; // package available at www.dancrintea.ro/doc-to-pdf/
...
FileInputStream fis = new FileInputStream(new File("test.doc"));
FileOutputStream fos = new FileOutputStream(new File("test.html"));
OfficeFile f = new OfficeFile(fis,"localhost","8100", true);
f.convert(fos,"html");

All possible conversions:

doc --> pdf, html, txt, rtf

xls --> pdf, html, csv

ppt --> pdf, swf

html --> pdf

Prodrome answered 5/2, 2009 at 11:44 Comment(0)
A
0

you can use micrsoft office online

first, on server side request https://view.officeapps.live.com/op/view.aspx?src='your doc file online url'

then use jsoup parse the result html

when access from mobile the html will have a frame wrapped.

Alloy answered 27/1, 2019 at 3:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.