The Stanford Arabic segmenter can't do true lemmatization. However, it is possible to train a new model to do something like stemming:
- تكتبون ← ت+ كتب +ون
- يتصل ← ي+ تصل
If it is very important that the output is real Arabic lemmas ("تصل" is not a true lemma), you might be better off with a tool like MADAMIRA (http://nlp.ldeo.columbia.edu/madamira/).
Elaboration: The Stanford Arabic segmenter produces its output character-by-character using only these operations (implemented in edu.stanford.nlp.international.arabic.process.IOBUtils
):
- Split a word between two characters
- Transform lil- (للـ) into li+ al- (ل+ الـ)
- Transform ta (ت) or ha (ه) into ta marbuta (ة)
- Transform ya (ي) or alif (ا) into alif maqsura (ى)
- Transform alif maqsura (ى) into ya (ي)
So lemmatizing يتصل to ي+ اتصل would require implementing an extra rule, i.e., to insert an alif after ya or ta. Lemmatization of certain irregular forms would be completely impossible (for example, نساء ← امرأة).
The version of the Stanford segmenter available for download also only breaks off pronouns and particles:
وسيكتشفونه ← و+ س+ يكتشفون +ه
However, if you have access to the LDC Arabic Treebank or a similarly rich source of Arabic text with morphological segmentation annotated, it is possible to train your own model to remove all morphological affixes, which is closer to lemmatization:
وسيكتشفونه ← و+ س+ ي+ كتشف +ون +ه
Note that "كتشف" is not a real Arabic word, but the segmenter should at least consistently produce "كتشف" for تكتشفين ,أكتشف ,يكتشف, etc. If this is acceptable, you would need to change the ATB preprocessing script to instead use the morphological segmentation annotations. You could do this by replacing the script called parse_integrated
with a modified version like this: https://gist.github.com/futurulus/38307d98992e7fdeec0d
Then follow the instructions for "TRAINING THE SEGMENTER" in the README.