I'm trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius package (using this tutorial here: http://www.hackzine.org/using-apache-tika-from-python-with-jnius.html) but the code throws an error.
Using the TIKA package however I was able to pass files and parse them but Python is only able to extract metadata and when asked to parse content, Python returns output "none". It is able to perfectly parse .txt files but fails for content extraction for PDFs. Here's the code
import tika
tika.initVM()
from tika import parser
parsed = parser.from_file('/path/to/file')
print parsed["metadata"]
print parsed["content"]
Do I require additional packages/codelines to be able to extract the data?