Solr for Arabic PDF's
Asked Answered
H

1

6

I am trying to search arabic PDFs in Apache Solr. The problem appears to be that Tika indexes the PDF in reverse order (Left-to-right) instead of (Right-to-left).

I have found references about this problem here:

However, I don't know how to include the latest version of PDFBOX or ICU4J in my apache solr. My Apache Solr Contrib/extraction/lib folder contains pdfbox-1.6.0.jar and icu4j-4.8.1.1.jar . Will removing the mentioned files and replacing them with the latest libraries from their projects pages be satisfactory to force TIKA to use them?

Please explain as I don't have a previous experience with Java servlet. Thanks!

Herein answered 27/11, 2012 at 17:27 Comment(0)
P
0

From the tags on your question I assume that you are using Drupal to interface Apache Solr. Tika can run from within Solr when you send it binary documents or you could use it before sending the documents to Solr. The Drupal Solr Attachments module has a setting for that "Tika (local java application)". In the second link you provided they patched the Solr Attachments module to use PDFBox instead of Tika to parse the binary files before sending it to Solr. If you are not using Drupal you should try a similar approach.

Philipson answered 28/2, 2013 at 18:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.