Removing Arabic Diacritics using Python
Asked Answered
A

3

8

I want to filter my text by removing Arabic diacritics using Python.

For example:

Context Text
Before filtering اللَّهمَّ اغْفِرْ لنَا ولوالدِينَا
After filtering اللهم اغفر لنا ولوالدينا

I have found that this can be done using CAMeL Tools but I am not sure how.

Ackerley answered 7/4, 2021 at 14:25 Comment(4)
Looks like this answer has what you are looking for? https://mcmap.net/q/40948/-what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-stringCapable
Unfortunately, it does not work for ArabicAckerley
The trick is to ensure text is normalised to NFC, rather than NFD, then strip non-spacing marks (diacritics).Rhythmandblues
''.join(c for c in unicodedata.normalize('NFC', text) if unicodedata.category(c) != 'Mn')Rhythmandblues
T
14

You can use the library pyArabic like this:

import pyarabic.araby as araby

before_filter="اللَّهمَّ اغْفِرْ لنَا ولوالدِينَا"
after_filter = araby.strip_diacritics(before_filter)

print(after_filter)
# will print : اللهم اغفر لنا ولوالدينا

You can try different strip filters:

araby.strip_harakat(before_filter)  # 'اللّهمّ اغفر لنا ولوالدينا'
araby.strip_lastharaka(before_filter)  # 'اللَّهمَّ اغْفِرْ لنَا ولوالدِينَا'
araby.strip_shadda(before_filter)  # 'اللَهمَ اغْفِرْ لنَا ولوالدِينَا'
araby.strip_small(before_filter)  # 'اللَّهمَّ اغْفِرْ لنَا ولوالدِينَا'
araby.strip_tashkeel(before_filter)  # 'اللَّهمَّ اغْفِرْ لنَا ولوالدِينَا'
araby.strip_tatweel(before_filter)  # 'اللَّهمَّ اغْفِرْ لنَا ولوالدِينَا'
Tricycle answered 7/4, 2021 at 14:44 Comment(0)
C
2

You really don't need to use any library for this, just plain regex:

import re
text = 'اللَّهمَّ اغْفِرْ لنَا ولوالدِينَا '    
output=re.sub(u'[\u064e\u064f\u0650\u0651\u0652\u064c\u064b\u064d\u0640\ufc62]','',text)
print(output)
#اللهم اغفر لنا ولوالدينا 
Cathexis answered 3/12, 2023 at 13:58 Comment(0)
E
0

Oneliner:

text = 'text with Arabic Diacritics to be removed'    
text = ''.join([t for t in text if t not in ['ِ', 'ُ', 'ٓ', 'ٰ', 'ْ', 'ٌ', 'ٍ', 'ً', 'ّ', 'َ']])
print(text)

if you want the full list of Arabic Diacritics you can also get it from pyArabic, standalone example:

import unicodedata
try:
    unichr
except NameError:
    unichr = chr

text = 'اللَّهمَّ اغْفِرْ لنَا ولوالدِينَا '    
text = ''.join([t for t in text if t not in [unichr(x) for x in range(0x0600, 0x06ff) if unicodedata.category(unichr(x)) == "Mn"]])
print(text)
Endogen answered 20/7, 2022 at 5:11 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.