Apache Tika and document metadata

I'm doing simple processing of variety of documents (ODS, MS office, pdf) using Apache Tika. I have to get at least :

word count, author, title, timestamps, language etc.

which is not so easy. My strategy is using Template method pattern for 6 types of document, where I find the type of document first, and based on that I process it individually.

I know that apache tika should remove the need for this, but the document formats are quite different right ?

For instance

InputStream input = this.getClass().getClassLoader().getResourceAsStream(doc);
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new OfficeParser();
parser.parse(input, textHandler, metadata, new ParseContext());
input.close();

for(String s : metadata.names()) {
    System.out.println("Metadata name : "  + s);
}

I tried to do this for ODS, MS office, pdf documents, and the metadada differs a lot. There is MSOffice interface that lists metadata keys for MS documents and some Dublic Core metadata list. But how should one implement an application like this ?

Could please anybody who has experience with it share his experience ? Thank you

Metadata metadata = new Metadata(); metadata.set(Metadata.RESOURCE_NAME_KEY, filename); ParseContext context = new ParseContext(); Parser parser = new AutoDetectParser(); parser.parse(input, textHandler, metadata, new ParseContext()); if(metadata.get(CONTENT_TYPE).equals("application/pdf")) { // Do something special with the PDF metadata here }

Recommended topics

Hot tags