I'm attempting to use Tika's AutoDetectParser to pull a file's content. I originally thought this was a dependency issue but cannot fathom how that could still be true now that i'm including all of tika-app in my jar.
AutoDetect Parser returns emptry string here :
BodyContentHandler handler = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
FileInputStream mypdfstream = new FileInputStream(new File("mypdf.pdf"));
parser.parse(mypdfstream,handler,metadata,context);
System.out.println(handler.toString());
Further confusing me is the fact that using a standard PDFParser works fine...:
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
FileInputStream mypdfstream = new FileInputStream(new File("mypdf.pdf"));
PDFParser pdfparser = new PDFParser();
pdfparser.parse(mypdfstream,handler,metadata,context);
System.out.println(handler.toString());
I have included both the tika-app and tika-parsers jar on my classpath and included them within the jar created by ant.
relevant portions of build.xml
<javac srcdir="${src}" destdir="${build}">
<classpath>
<pathelement path = "lib/tika-app-1.11.jar"/>
<pathelement path = "lib/tika-parsers-1.11.jar"/>
</classpath>
</javac>
<jar jarfile="${dist}/lib/MyProject-${DSTAMP}.jar" basedir="${build}">
<zipgroupfileset dir="lib" includes="tika-app-1.11.jar"/>
<zipgroupfileset dir="lib" includes="tika-parsers-1.11.jar"/>
</jar>
Edit: I looked at my list of supportedTypes
with parser.getSupportTypes(context))
and it was empty. As is the list of parsers returned from parser.getParsers()
.
So perhaps this is yet another dependency issue? This truly surprises me given tika-app is included.
<java
task which has a classpath entry that pulls in the required jars. If not, look at Apache Ivy, which'll let you suck down dependencies from within Ant – Dutyboundtika.getParser()
rather thanparser.getParser()
on the 4th line in this code snippet: wiki.apache.org/tika/Troubleshooting%20Tika#Tika_Facade-2? – Landward