I am trying to search arabic PDFs in Apache Solr. The problem appears to be that Tika indexes the PDF in reverse order (Left-to-right) instead of (Right-to-left).
I have found references about this problem here:
- Solr for Arabic
- How to parse arabic pdf with Tika
- http://www.linnovate.net/blog/apache-solr-search-hebrew-and-probably-arabic-documents-drupal-pdf-problem-solution
However, I don't know how to include the latest version of PDFBOX or ICU4J in my apache solr. My Apache Solr Contrib/extraction/lib
folder contains pdfbox-1.6.0.jar
and icu4j-4.8.1.1.jar
. Will removing the mentioned files and replacing them with the latest libraries from their projects pages be satisfactory to force TIKA to use them?
Please explain as I don't have a previous experience with Java servlet. Thanks!