I tried converting .doc to HTML by using WordToHtmlConverter
and it worked perfectly.
But when i tried to convert .docx to HTML, i got stuck with it.
What i tried:
I used the below code to convert .docx to HTML:
The code which i tried from : How to use Tika's XWPFWordExtractorDecorator class?
InputStream input = TikaInputStream.get(new File("C:\\Users\\Downloads\\filename.docx"));
Parser parser = new AutoDetectParser();
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
handler.setResult(new StreamResult(sw));
try {
Metadata metadata = new Metadata();
parser.parse(input, handler, metadata, new ParseContext());
String xml = sw.toString();
System.out.print("tika : "+xml);
} finally {
input.close();
}
The output what i got is,
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body/>
</html>
- Please explain where i gone wrong?
- Is there any better way to convert .docx to html string
Appreciate your help, Thanks
.docx
files are an archive (you can open them with something like 7zip and view the contents) containing a bunch of XML files. With that in mind, you'd want to use something that can transform the XML into HTML. – Forequarter