Python ISRIStemmer for Arabic text
Asked Answered
D

7

6

I am running the following code on IDLE(Python) and I want to enter Arabic string and get the stemming for it but actually it doesn't work

>>> from nltk.stem.isri import ISRIStemmer
>>> st = ISRIStemmer()
>>> w= 'حركات'
>>> join = w.decode('Windows-1256')
>>> print st.stem(join).encode('Windows-1256').decode('utf-8')

The result of running it is the same text in w which is 'حركات' which is not the stem

But when do the following:

>>> print st.stem(u'اعلاميون')

The result succeeded and returns the stem which is 'علم'

Why passing some words to stem() function doesn't return the stem?

Demarcate answered 1/2, 2014 at 0:17 Comment(0)
D
5

Ok, I solved the problem by myself using the following:

w = 'حركات' 
st.stem(w.decode('utf-8'))

and it gives the root correctly which is "حرك"

Demarcate answered 2/2, 2014 at 2:33 Comment(0)
M
8

This code above won't work in Python 3 because we are trying to decode an object that is already decoded. So, there is no need to decode from UTF-8 anymore.

Here is the new code that should work just fine in Python 3.

import nltk
from nltk.stem.isri import ISRIStemmer
st = ISRIStemmer()
w= 'حركات'
print(st.stem(w))
Malchy answered 16/3, 2017 at 16:53 Comment(0)
D
5

Ok, I solved the problem by myself using the following:

w = 'حركات' 
st.stem(w.decode('utf-8'))

and it gives the root correctly which is "حرك"

Demarcate answered 2/2, 2014 at 2:33 Comment(0)
A
3

there is a new light arabicstemmer here developed with snowball framework

Ahasuerus answered 6/12, 2016 at 19:45 Comment(0)
G
2

You can use this snippet to directly stem text:

from nltk import word_tokenize

from nltk.stem.isri import ISRIStemmer

st = ISRIStemmer()

w= " البحث العلمي أو البحث أو التجربة التنموية هو أسلوب منظم في جمع المعلومات الموثوقة وتدوين الملاحظات والتحليل الموضوعي لتلك المعلومات باتباع أساليب ومناهج علمية محددة بقصد التأكد من صحتها أو تعديلها أو إضافة الجديد لها، ومن ثم التوصل إلى بعض القوانين والنظريات والتنبؤ بحدوث مثل هذه الظواهر والتحكم في أسبابها"

for a in word_tokenize(w):

    print(st.stem(a))
Geller answered 1/3, 2019 at 14:45 Comment(0)
U
1

Here is another example on how to use the stemmer (of course you can remove stop words first!)

from nltk import word_tokenize
from nltk.stem.isri import ISRIStemmer

st = ISRIStemmer()
word_list = "عرض يستخدم الى التفاعل مع المستخدمين في هاذا المجال !وعلمآ تكون الخدمه للستطلاع على الخدمات والعروض المقدمة"

# Define a function
def filter(word_list):
    wordsfilter=[]
    for a in word_tokenize(word_list):
        stem = st.stem(a)
        wordsfilter.append(stem)
    print(wordsfilter)

# Call the function
filter(word_list)

Here is the result:

['عرض', 'خدم', 'الى', 'فعل', 'مع', 'خدم', 'في', 'هذا', 'جال', '!', 'علمآ', 'تكون', 'خدم', 'طلع', 'على', 'خدم', 'عرض', 'قدم']
Unmannered answered 4/10, 2020 at 8:13 Comment(0)
P
0

Well, just notice that your two strings actually only differ by a mere "u" at the beginning of the second string :

w = 'حركات'
w2 = u'اعلاميون'

But that tiny "u" made all the difference : w is a UTF8 string (default character encoding in Python), while w2 is a Unicode string.

Hence all you really need to do is make sure your string is defined as a Unicode string, and then you can use the stem function normally without any extra decoding step :

w = u'حركات'
print st.stem(w)
Predator answered 23/5, 2016 at 18:35 Comment(0)
E
-1

from nltk import word_tokenize from nltk.stem.isri import ISRIStemmer

st = ISRIStemmer() word_list = "من طلب العلا سهر الليالي"

Define a function

def filter(word_list): wordsfilter=[] for a in word_tokenize(word_list): stem = st.stem(a) wordsfilter.append(stem) print(wordsfilter)

Call the function

filter(word_list)

Emogene answered 18/12, 2023 at 3:8 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Ly

© 2022 - 2025 — McMap. All rights reserved.