Tika AutoDetectParser returning empty string?
Asked Answered
P

2

9

I'm attempting to use Tika's AutoDetectParser to pull a file's content. I originally thought this was a dependency issue but cannot fathom how that could still be true now that i'm including all of tika-app in my jar.

AutoDetect Parser returns emptry string here :

BodyContentHandler handler = new BodyContentHandler();  
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
FileInputStream mypdfstream = new FileInputStream(new File("mypdf.pdf"));
parser.parse(mypdfstream,handler,metadata,context);
System.out.println(handler.toString());

Further confusing me is the fact that using a standard PDFParser works fine...:

BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
FileInputStream mypdfstream = new FileInputStream(new File("mypdf.pdf"));
PDFParser pdfparser = new PDFParser();
pdfparser.parse(mypdfstream,handler,metadata,context);
System.out.println(handler.toString());

I have included both the tika-app and tika-parsers jar on my classpath and included them within the jar created by ant.

relevant portions of build.xml

<javac srcdir="${src}" destdir="${build}">
                <classpath>
                        <pathelement path = "lib/tika-app-1.11.jar"/>
                        <pathelement path = "lib/tika-parsers-1.11.jar"/>
                </classpath>
 </javac>

<jar jarfile="${dist}/lib/MyProject-${DSTAMP}.jar" basedir="${build}">
        <zipgroupfileset dir="lib" includes="tika-app-1.11.jar"/>
        <zipgroupfileset dir="lib" includes="tika-parsers-1.11.jar"/>
</jar>

Edit: I looked at my list of supportedTypes with parser.getSupportTypes(context)) and it was empty. As is the list of parsers returned from parser.getParsers().

So perhaps this is yet another dependency issue? This truly surprises me given tika-app is included.

Panhandle answered 21/12, 2015 at 20:4 Comment(6)
Did you try following the Tika Troubleshooting - No Content Extracted guide from Apache Tika? Also, are you aware that bundling a jar in a jar doesn't automatically work to put it on the classpath?Dutybound
So my issue is that it appears to not be aware of the parsers at runtime. I was under the impression that the lines in the ant file above included the jar on the class path. I've been Im able to get this to work in my environment if i do a export CLASSPATH=/location/of/tika/app. What is the "proper" way of doing this with ant? Im generally confused by compile path vs. runtime classpath.Panhandle
If you're happy to let Ant launch the program for you, just do a <java task which has a classpath entry that pulls in the required jars. If not, look at Apache Ivy, which'll let you suck down dependencies from within AntDutybound
Well, i'm submitting the jar to spark and i'm on a machine that has no internet access. I have previously seen some manifest classpath tags but they did not function in the way i expected them to.Panhandle
I would suggest you ask a fresh question on how to get your code + all the dependency jars to be correctly bundled/deployed to spark. Once you know how to get that right, that may solve this issue, or may get you closerDutybound
@Dutybound Should the troubleshooting guide read tika.getParser() rather than parser.getParser() on the 4th line in this code snippet: wiki.apache.org/tika/Troubleshooting%20Tika#Tika_Facade-2?Landward
R
3

I have the same issue, i have corrected adding the Tika Core and Parser dependency on my Pom.xml like this again and then Update Maven on Eclipse.

    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-core</artifactId>
      <version>1.18</version>
    </dependency>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>1.18</version>
    </dependency>
Respirator answered 24/9, 2018 at 22:5 Comment(0)
C
0

For me downgrading libs to 1.18 worked. Shame that newer versions are not working

Cori answered 11/9, 2023 at 8:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.