Solr for Arabic
Asked Answered
D

1

7

I'm using Solr to index documents in 3 langues(arabic, french and english), I have used this fieldType :

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> 
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

Everything was good, but in arabic language when I put this request to search a word like حقل Solr doen't find the word, but when I put the word in oppositeلقح from left to right Solr find the word and return result.

Can I have result for arabic words ?

Dardanus answered 20/10, 2011 at 10:13 Comment(5)
I don't know of any mechanism that could reverse the order of RTL text in Solr. Generally, folks find that they want some sort of lemmatization in Arabic to deal with all the inflected forms. What are you using to build the UI that you are typing the search terms into?Leesen
I'm using a web page, also in my test I use Eclipse directly with API solrj.Dardanus
Are you by any chance extracing your text from PDF files? If so there seems to be a known problem with Tika: issues.apache.org/jira/browse/…Godfree
Thank you Daniel and bmargulies, Yes I'm using Tika to extract text from PDF files, and the result of extracting file was in opposit form, Is there another method to extract data from PDF files?Dardanus
We submitted patches to pdfbox that causes it to correctly extract Arabic text. I wonder if Tika has a current copy of PDFbox? Please in any case submit a JIRA at Apache Tika.Leesen
L
5

I'm going to turn Daniel's clever analysis here to an answer for the record. Don't vote for this, just go find something of his to vote for :-)

There are two ways to get a directionality mismatch with RTL text. You can be indexing it backwards, or you can be querying it backwards. A simple HTML form querying Solr will never mess up directionality. In this care, khaled was extracting text from a PDF using a library that falls victim to the tendency of PDFs to contain 'visual-order' text rather than 'logical order'. So the index was full of backwards Arabic. To fix this, he will have to come up with a working library that extracts text from pdfs.

Forcing Apache Tika to use the latest Apache PDFbox might help, or his PDF may be so quirky that even the latest PDFBox can't handle it. In which case he has a hard problem.

Leesen answered 20/10, 2011 at 12:54 Comment(3)
Thank you bmargulies, I have include ICU4J.jar in my project, Now Tika can extract arabic text without any problem.Dardanus
Please khaled Mabrouk I have the same issue , can you just give the solution in the following question : #10077459Robbi
Hi Khaled, what do you mean by "include ICU4J" in the project? I have no idea how can this be done. Can anyone shed some light on this?Feudal

© 2022 - 2024 — McMap. All rights reserved.