how can I detect farsi web pages by tika?
Asked Answered
D

1

5

I need a sample code to help me detect farsi language web pages by apache tika toolkit.

 LanguageIdentifier identifier = new LanguageIdentifier("فارسی");
        String language = identifier.getLanguage();

I have download apache.tika jar files and add them to the classpath. but this code gives error for Farsi language but it works for english. how can I add Farsi to languageIdentifier package of tika?

Dissimilar answered 28/1, 2012 at 11:30 Comment(3)
What kind of error? Please post the stacktrace.Chartier
it is surprising but I dont have any error now but the problem is that it detects incorrectly. it return "lt" that means Lithuanian language instead of Persian(farsi) languageDissimilar
my question is how tika detects languages? with what files? for example if it uses stop words of any language where can I add the stop words of farsi language?Dissimilar
E
9

Tika doesn't ship with a language profile for the Farsi language yet. As of version 1.0 27 languages are supported out of the box:

languages=be,ca,da,de,eo,et,el,en,es,fi,fr,gl,hu,is,it,lt,nl,no,pl,pt,ro,ru,sk,sl,sv,th,uk

In your example the input is misdetected as li(Lithuanian) with a distance of 0.41, which is above the certainty threshold of 0.022. See the source code for more information on the inner works of LanguageIdentifier.

The Farsi language (Persian, ISO 639-1 2-letter code fa) is not recognized by default. If you want Tika to recognize another language, you have to create a language profile first.

For this the following steps are necessary:

  1. Find a text corpus for your language. I found the Hamshahri Collection. This should be sufficient. Download the corpus or parts of it and create a plain text file out of the XML.

  2. Create an ngram file for the language identifier. This can be done using TikaCLI:

    java -jar tika-app-1.0.jar --create-profile=fa -eUTF-8 fa-corpus.txt This will a file called fa.ngp which contains the n-grams.

  3. Configure Tika so that it recognizes the new language. Either do this programmatically using LanguageIdentifier.initProfiles() or put a property file with the name tika.language.override.properties into the classpath. Make sure the ngram file is in the classpath as well.

If you now run Tika, it should correctly detect your language.

Update: Detailed the steps necessary to create a language profile.

Eade answered 28/1, 2012 at 12:56 Comment(6)
I follow the link but I dont understand how can I create the language profile. can you help me?Dissimilar
in fact. I ask my question about create language profile in link below: #6228065Dissimilar
after finding enough text corpus, what should I do?Dissimilar
thank you alot. can you give some sample code on how to use TikaCLI?Dissimilar
You missed out the final step of contributing the new profile back to Tika! :)Petunia
stackoverflow.com/users/283200/kai-sternad Looks like '--create-profile' has been dropped in newer version of tika. How does one go abou contributing new profiles?Electrobiology

© 2022 - 2024 — McMap. All rights reserved.