I'm doing simple processing of variety of documents (ODS, MS office, pdf) using Apache Tika. I have to get at least :
word count, author, title, timestamps, language etc.
which is not so easy. My strategy is using Template method pattern for 6 types of document, where I find the type of document first, and based on that I process it individually.
I know that apache tika should remove the need for this, but the document formats are quite different right ?
For instance
InputStream input = this.getClass().getClassLoader().getResourceAsStream(doc);
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new OfficeParser();
parser.parse(input, textHandler, metadata, new ParseContext());
input.close();
for(String s : metadata.names()) {
System.out.println("Metadata name : " + s);
}
I tried to do this for ODS, MS office, pdf documents, and the metadada differs a lot. There is MSOffice interface that lists metadata keys for MS documents and some Dublic Core metadata list. But how should one implement an application like this ?
Could please anybody who has experience with it share his experience ? Thank you