Arabic lemmatization and Stanford NLP
Asked Answered
S

2

5

I try to make lemmatization, ie identifying the lemma and possibly the Arabic root of a verb, for example: يتصل ==> lemma (infinitive of the verb) ==> اتصل ==> root (triliteral root / Jidr thoulathi) ==> و ص ل

Do you think Stanford NLP can do that?

Best Regards,

Seldun answered 19/3, 2015 at 17:33 Comment(2)
First google result: nlp.stanford.edu/projects/arabic.shtmlAffettuoso
Thank you. I know about that but what I am trying to do is the lemmatization of the Arabic words using stanford NLP tool.Seldun
F
12

The Stanford Arabic segmenter can't do true lemmatization. However, it is possible to train a new model to do something like stemming:

  • تكتبون ← ت+ كتب +ون
  • يتصل ← ي+ تصل

If it is very important that the output is real Arabic lemmas ("تصل" is not a true lemma), you might be better off with a tool like MADAMIRA (http://nlp.ldeo.columbia.edu/madamira/).

Elaboration: The Stanford Arabic segmenter produces its output character-by-character using only these operations (implemented in edu.stanford.nlp.international.arabic.process.IOBUtils):

  • Split a word between two characters
  • Transform lil- (للـ) into li+ al- (ل+ الـ)
  • Transform ta (ت) or ha (ه) into ta marbuta (ة)
  • Transform ya (ي) or alif (ا) into alif maqsura (ى)
  • Transform alif maqsura (ى) into ya (ي)

So lemmatizing يتصل to ي+ اتصل would require implementing an extra rule, i.e., to insert an alif after ya or ta. Lemmatization of certain irregular forms would be completely impossible (for example, نساء ← امرأة).

The version of the Stanford segmenter available for download also only breaks off pronouns and particles:

وسيكتشفونه ← و+ س+ يكتشفون +ه

However, if you have access to the LDC Arabic Treebank or a similarly rich source of Arabic text with morphological segmentation annotated, it is possible to train your own model to remove all morphological affixes, which is closer to lemmatization:

وسيكتشفونه ← و+ س+ ي+ كتشف +ون +ه

Note that "كتشف" is not a real Arabic word, but the segmenter should at least consistently produce "كتشف" for تكتشفين ,أكتشف ,يكتشف, etc. If this is acceptable, you would need to change the ATB preprocessing script to instead use the morphological segmentation annotations. You could do this by replacing the script called parse_integrated with a modified version like this: https://gist.github.com/futurulus/38307d98992e7fdeec0d

Then follow the instructions for "TRAINING THE SEGMENTER" in the README.

Fluctuation answered 23/3, 2015 at 20:32 Comment(0)
D
3

I am not sure if Stanford NLP toolkit has a lammetizer, but you can try

Farasa Lemmatizer outperforms MADAMIRA Lemmatizer based on accuracy. With accuracy about 97.23% It gives +7% relative gain above MADAMIRA in lemmatization task.

You can read more about Farasa Lemmatizer from the following link: https://arxiv.org/pdf/1710.06700.pdf

Dexter answered 11/12, 2017 at 10:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.