Regex for check the input string is just in persian language
Asked Answered
P

7

42

I work with MVC and I am new on it. I want to check input values is only in Persian language (Characters) by [RegularExpression] Validation. So I think to use Regex and need to check in range of unicodes, but I don't lnow how can find range of Persian characters Unicode. Am I right about this Regex? what is your suggestion and how can I find range of Unicode in Persian

Primordial answered 12/5, 2012 at 6:30 Comment(2)
I don't see why you would need a regex to check whether a character is within a given range.Tnt
Characters != language. For example, 'hdafhladf' is not English. And I'm sure there are some characters that are not officially classified as "Persian" but are allowed in Persian language (maybe whitespace characters?)Buckeye
P
26

Check first letter and last letter range in Persian I think something like this:

"^[آ-ی]$"
Primordial answered 13/5, 2012 at 5:31 Comment(4)
not work with any Persian chars be like "خ", "پ", ... because this is not in Arabic language! I thinks better is use: [\u0600-\u06FF]Imprest
@NabiK.A.Z. آ codepoint is 0622 and ی codepoint is 06CC and Arabic Letter Khah خ is 062E. So it's included in said range. That's right with پ too. BTW, why didn't you update your blog for years?Chincapin
@revo, You said correct, but the other way, in the [آ-ی] you don't allow ء, ،, ؛, ۰-۹,... (regex101.com/r/rM1TnT/1) but in [\u0600-\u06FF] you can use more of the required characters: (regex101.com/r/rM1TnT/2) Of course, this depends on the user's needs. And about my blog, Thanks, I don't have any answer for it! :-D maybe it was talisman !!! ;-)Imprest
@NabiK.A.Z. [آ-ی] shouldn't contain numbers if someone use it while thinking about a similar range like [a-z] (it means letters only). But [آ-ی] contains Arabic numbers too and has much more characters than some Persian user needs. Second range [\u0600-\u06FF] also includes superfluous characters and symbols which we can't call it Farsi. I posted an answer at current page and with more details here about this topic you may want to see.Chincapin
E
29

Persian characters are within the range: [\u0600-\u06FF]

Try:

Regex.IsMatch(value, @"^[\u0600-\u06FF]+$")
Essie answered 12/5, 2012 at 8:59 Comment(1)
[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF] is better. see https://mcmap.net/q/241857/-regex-for-accepting-only-persian-charactersConvulsion
P
26

Check first letter and last letter range in Persian I think something like this:

"^[آ-ی]$"
Primordial answered 13/5, 2012 at 5:31 Comment(4)
not work with any Persian chars be like "خ", "پ", ... because this is not in Arabic language! I thinks better is use: [\u0600-\u06FF]Imprest
@NabiK.A.Z. آ codepoint is 0622 and ی codepoint is 06CC and Arabic Letter Khah خ is 062E. So it's included in said range. That's right with پ too. BTW, why didn't you update your blog for years?Chincapin
@revo, You said correct, but the other way, in the [آ-ی] you don't allow ء, ،, ؛, ۰-۹,... (regex101.com/r/rM1TnT/1) but in [\u0600-\u06FF] you can use more of the required characters: (regex101.com/r/rM1TnT/2) Of course, this depends on the user's needs. And about my blog, Thanks, I don't have any answer for it! :-D maybe it was talisman !!! ;-)Imprest
@NabiK.A.Z. [آ-ی] shouldn't contain numbers if someone use it while thinking about a similar range like [a-z] (it means letters only). But [آ-ی] contains Arabic numbers too and has much more characters than some Persian user needs. Second range [\u0600-\u06FF] also includes superfluous characters and symbols which we can't call it Farsi. I posted an answer at current page and with more details here about this topic you may want to see.Chincapin
C
16

TL;DR

All answers that say use \u0600-\u06FF or [آ-ی] are simply WRONG.

i.e. \u0600-\u06FF contains 209 more characters than you need! and it includes numbers too!

Farsi MUST used character sets are as following:

  • Use ^[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی]+$ for letters.

  • Use ^[۰۱۲۳۴۵۶۷۸۹]+$ for numbers.

  • Use [ ‬ٌ ‬ًّ ‬َ ‬ِ ‬ُ ‬ْ ‬] for vowels.

Or a union of those. You may want to add other Arabic letters like Hamza ء to your character set additionally.


This answer exists to fix a common misconception. Codepoints 0600 through 06FF do not denote Persian / Farsi alphabet (neither does [آ-ی]):

[\u0600-\u0605 ؐ-ؚ\u061Cـ ۖ-\u06DD ۟-ۤ ۧ ۨ ۪-ۭ ً-ٕ ٟ ٖ-ٞ ٰ ، ؍ ٫ ٬ ؛ ؞ ؟ ۔ ٭ ٪ ؉ ؊ ؈ ؎ ؏
۞ ۩ ؆ ؇ ؋ ٠۰ ١۱ ٢۲ ٣۳ ٤۴ ٥۵ ٦۶ ٧۷ ٨۸ ٩۹ ءٴ۽ آ أ ٲ ٱ ؤ إ ٳ ئ ا ٵ ٮ ب ٻ پ ڀ
ة-ث ٹ ٺ ټ ٽ ٿ ج ڃ ڄ چ ڿ ڇ ح خ ځ ڂ څ د ذ ڈ-ڐ ۮ ر ز ڑ-ڙ ۯ س ش ښ-ڜ ۺ ص ض ڝ ڞ
ۻ ط ظ ڟ ع غ ڠ ۼ ف ڡ-ڦ ٯ ق ڧ ڨ ك ک-ڴ ػ ؼ ل ڵ-ڸ م۾ ن ں-ڽ ڹ ه ھ ہ-ۃ ۿ ەۀ وۥ ٶ
ۄ-ۇ ٷ ۈ-ۋ ۏ ى يۦ ٸ ی-ێ ې ۑ ؽ-ؿ ؠ ے ۓ \u061D]

255 characters are fallen in this range, Farsi alphabet has 32 letters that in addition to Farsi demonstration of digits it would be 42. If we add vowels (Arabic vowels originally, that rarely used in Farsi) and Tanvin (ً, ٍِ ‬, ٌ ‬) and Tashdid (ّ ‬) that are both a subset of Arabic diacritics not Farsi, we'd end with 46 characters. This means:

\u0600-\u06FF contains 209 more characters than you need!

۷ with codepoint 06F7 is a Farsi representation of number 7 and ٧ with codepoint 0667 is Arabic representation of the same number. ۶ is Farsi representation of number 6 and ٦ is Arabic representation of the same number. And all reside in 0600 through 06FF codepoints.

The shapes of the Persian digits four (۴), five (۵), and six (۶) are different from the shapes used in Arabic and the other numbers have different codepoints.

You can see different number of other characters that doesn't exist in Farsi / Persian too and nobody is willing to have them while validating a first name or surname.

[آ-ی] includes 117 characters too which is much more than what someone needs for validation. You can see them all using Unicode CLDR.

Chincapin answered 25/4, 2018 at 9:1 Comment(0)
S
15
Regex.IsMatch(Text, @"^([\u0600-\u06FF]+\s?)+$")    

This Only Contain standard Arabic symbols range But Persian also include 4 More Characters:

ژ \uFB8A
پ \u067E
چ \u0686
گ \u06AF

So You Should Use:

^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF]+$

If you want to match Zero-width-non-joiner you should add this too:

\u200C
Syllogistic answered 20/10, 2014 at 11:45 Comment(2)
consider accepting space and Zero-width non-joiner in characters! https://mcmap.net/q/241857/-regex-for-accepting-only-persian-charactersFreeland
Codepoints 0600 through 06FF include 067E, 0686 and 06AF obviously. You don't need to repeat them.Chincapin
B
9

I use this RegExp in my program, and it works correctly. hope to help you:

 [پچجحخهعغفقثصضشسیبلاتنمکگوئدذرزطظژؤآإأءًٌٍَُِّ\s]+$
Beaconsfield answered 8/7, 2013 at 13:19 Comment(1)
The better way is to do the following: Regex.IsMatch(Text, @"^([\u0600-\u06FF]+\s?)+$")Enounce
J
0

Persian characters are within the range: [\u0600-\u06FF] + [\s]

Try:

Regex.IsMatch(Text, @"^([\u0600-\u06FF]+\s?)+$")

This Patern Contains Letter and space Charachters.

Jacquard answered 23/5, 2014 at 16:28 Comment(0)
S
0

I use these two RegExps in my program as some letters might be actually arabic:

^[!@#$%^&*(). ۱۲۳۴۵۶۷۸۹۰+-پچجحخهعغفقثصضشسیبلاﺐتنمکگوئدذرزطﺐظژؤآإأءًٌٍَُِّﻢﺷﺠﺪﮑﺬﻋﻮﻂﺶﺰﺣﻣﮕﻒﺤﻻﻄﻟﭼﻫﻼﻗﺒﺗﺨﻪﻬﻓﯾﺼﺟﮔﻇﺑﭽﺌﻞﺖﺿ]+$
^[ﻢﺷﺠﺪﮑﺬﻋﻮﻂﺶﺰﺣﻣﮕﻒﺤﻻﻄﻟﭼﻫﻼﻗﺒﺗﺨﻪﻬﻓﯾﺼﺟﮔﻇﺑﭽﺌﻞﺖﺿﺎﺄﭙﻈﻏﻦﯿﻔﻤﻨﻐﻌﮏﺻﺧﻃﭘﺳﻘﻧﯽﻖﺸﮐﻠﺴﺮﺘ]+$

it might not look very good but it works fine in my code.

Salpingectomy answered 3/9, 2023 at 6:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.