regex for accepting only persian characters

Asked 21/3, 2014 at 17:7 Answered 6/6 at 17:13

I'm working on a form where one of its custom validators should only accept Persian characters. I used the following code:

var myregex = new Regex(@"^[\u0600-\u06FF]+$");
if (myregex.IsMatch(mytextBox.Text))
{
    args.IsValid = true;
}
else
{
    args.IsValid = false;
}

However, it seems that it can only detect Arabic characters, as it doesn't cover all Persian characters (it lacks these four: گ,چ,پ,ژ ).

Is there a way to solve this problem?

Trump answered 21/3, 2014 at 17:7 Comment(7)

I can't comment on the persian characters, but if your custom validator is simply doing a regex check, then there is an <asp:RegularExpressionValidator> that will save you a bit of time – Heckman 21/3, 2014 at 17:11

but <asp:RegularExpressionValidator> doesn't check for persian character – Trump 21/3, 2014 at 17:14

You've just complained that the regex isn't working... and I've said I cannot help with that (it's way out of my experience). If you get the regex working, then <asp:RegularExpressionValidator> will work with it – Heckman 21/3, 2014 at 17:18

really?...now I got the answer in regex for using <asp:RegularExpressionValidator> shall I just copy it on the validation expression part to work? – Trump 21/3, 2014 at 17:28

Unless I'm missing something obvious, yes, convert your custom validator to a <asp:RegularExpressionValidator> and set ValidationExpression="^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF]+$" (as taken from Andrei's marked answer) – Heckman 21/3, 2014 at 17:30

this is a good range for persian characters but not clean one. take a look at this: utf8-chartable.de/unicode-utf8-table.pl?start=1536 – Transponder 31/8, 2014 at 11:20

Please do this instead: args.IsValid = myregex.IsMatch(mytextBox.Text)) – Triggerfish 2/5, 2018 at 10:30

145

TL;DR

Farsi MUST used character sets are as following:

Use ^[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی]+$ for letters or use codepoints regarding your regex flavor (not all engines support \uXXXX notation):
```
^[\u0622\u0627\u0628\u067E\u062A-\u062C\u0686\u062D-\u0632\u0698\u0633-\u063A\u0641\u0642\u06A9\u06AF\u0644-\u0648\u06CC]+$
```
Use ^[۰۱۲۳۴۵۶۷۸۹]+$ for numbers or regarding your regex flavor:
```
^[\u06F0-\u06F9]+$
```
Use [ ‬ٌ ‬ًّ ‬َ ‬ِ ‬ُ ‬ْ ‬] for vowels or regarding your regex flavor:
```
[\u202C\u064B\u064C\u064E-\u0652]
```

or a combination of those together. You may want to add other Arabic letters like Hamza ء to your character set additionally.

Why are `[\u0600-\u06FF]` and `[آ-ی]` both wrong?

Although `\u0600-\u06FF` includes:

گ with codepoint 06AF
چ with codepoint 0686
پ with codepoint 067E
ژ with codepoint 0698

as well, all answers that suggest `[\u0600-\u06FF]` or `[آ-ی]` are simply WRONG.

i.e. \u0600-\u06FF contains 209 more characters than you need! and it includes numbers too!

Whole story

This answer exists to fix a common misconception. Codepoints 0600 through 06FF do not denote Persian / Farsi alphabet (neither does [آ-ی]):

[\u0600-\u0605 ؐ-ؚ\u061Cـ ۖ-\u06DD ۟-ۤ ۧ ۨ ۪-ۭ ً-ٕ ٟ ٖ-ٞ ٰ ، ؍ ٫ ٬ ؛ ؞ ؟ ۔ ٭ ٪ ؉ ؊ ؈ ؎ ؏
۞ ۩ ؆ ؇ ؋ ٠۰ ١۱ ٢۲ ٣۳ ٤۴ ٥۵ ٦۶ ٧۷ ٨۸ ٩۹ ءٴ۽ آ أ ٲ ٱ ؤ إ ٳ ئ ا ٵ ٮ ب ٻ پ ڀ
ة-ث ٹ ٺ ټ ٽ ٿ ج ڃ ڄ چ ڿ ڇ ح خ ځ ڂ څ د ذ ڈ-ڐ ۮ ر ز ڑ-ڙ ۯ س ش ښ-ڜ ۺ ص ض ڝ ڞ
ۻ ط ظ ڟ ع غ ڠ ۼ ف ڡ-ڦ ٯ ق ڧ ڨ ك ک-ڴ ػ ؼ ل ڵ-ڸ م۾ ن ں-ڽ ڹ ه ھ ہ-ۃ ۿ ەۀ وۥ ٶ
ۄ-ۇ ٷ ۈ-ۋ ۏ ى يۦ ٸ ی-ێ ې ۑ ؽ-ؿ ؠ ے ۓ \u061D]

255 characters are fallen under Arabic block (0600–06FF), Farsi alphabet has 32 letters that in addition to Farsi demonstration of digits it would be 42. If we add vowels (Arabic vowels originally, that rarely used in Farsi) without Tanvin (ً, ٍِ ‬, ٌ ‬) and Tashdid (ّ ‬) that are both a subset of Arabic diacritics not Farsi, we would end up with 46 characters. This means \u0600-\u06FF contains 209 more characters than you need!

۷ with codepoint 06F7 is a Farsi representation of number 7 and ٧ with codepoint 0667 is Arabic representation of the same number. ۶ is Farsi representation of number 6 and ٦ is Arabic representation of the same number. And all reside in 0600 through 06FF codepoints.

The shapes of the Persian digits four (۴), five (۵), and six (۶) are different from the shapes used in Arabic and the other numbers have different codepoints.

You can see different number of other characters that doesn't exist in Farsi / Persian too and nobody is willing to have them while validating a first name or surname.

[آ-ی] includes 117 characters too which is much more than what someone needs for validation. You can see them all using Unicode CLDR.

Zounds answered 25/4, 2018 at 9:30 Comment(7)

Hi revo. Hey, I was reading your Wikipedia page on Persian. I just want to note that this [\u0622\u0627\u0628\u067E\u062A-\u062C\u0686\u062D-\u0632\u0698\u0633-\u063A\u0641\u0642\u06A9\u06AF\u0644-\u0648\u06CC] contains all ARABIC LETTER Unicode names. I guess it's hard to distinguish, and maybe that's why there is no Unicode script for Persian/Farsi. – Sot 27/4, 2018 at 20:9

@sln Yes, we don't have a separate Farsi block or Persian / Farsi named letters in Unicode table (actually we have but named old Persian which are cuneiform characters related to history). That's why we have many wrong answers exist. By the way, I'm a native Persian. – Zounds 27/4, 2018 at 20:15

Yeah, then you should know. I did a query on Old Persian. It worked out to this : [\p{Block=Old_Persian}\p{Script=Old_Persian}\p{Script_Extensions=Old_Persian}](?<!\p{General_Category=Unassigned}) which is utf-32 [\x{103A0}-\x{103C3}\x{103C8}-\x{103D5}] and utf-16 \uD800[\uDFA0-\uDFC3\uDFC8-\uDFD5] – Sot 27/4, 2018 at 20:34

Great work. I'm unaware about Persian language (but not about Persian poetry), but take care of the different forms of characters (if any of course, as you can see in other languages) with single codepoints and codepoints with combining characters. – Agglutinin 2/5, 2018 at 23:33

@CasimiretHippolyte Thank you. I'm glad to hear you know about our poetry and you made a valid point but we don't have any diacritical marks, Arabic has and that's of the main reasons for this answer to exist. – Zounds 3/5, 2018 at 9:20

Good job pointing out that while the Arabic and Farsi alphabets are similar, they are not identical. This is a mistake I have seen before on this site. – Blotchy 15/3, 2019 at 11:48

@Mehrdad88sh not better than you, my friend. – Zounds 9/10, 2019 at 4:41

What you currently have in your regex is a standard Arabic symbols range. For additional characters your need to add them to the regex separately. Here are their codes:

ژ \u0698
پ \u067E
چ \u0686
گ \u06AF

So all in all you should have

^[\u0600-\u06FF\u0698\u067E\u0686\u06AF]+$

Glint answered 21/3, 2014 at 17:19 Comment(4)

tnx... is there a way to add "space" to these combination too? – Trump 21/3, 2014 at 17:31

@sara.y, of course, just add it to the end of the character list, like this: ^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF ]+$ – Glint 21/3, 2014 at 17:32

Doesn't \u0600-\u06FF include \u0698 or the other additional codepoints? – Zounds 25/4, 2018 at 9:21

@Zounds you are right! – Natiha 17/5, 2022 at 15:34

In addition to the accepted answer(https://mcmap.net/q/241857/-regex-for-accepting-only-persian-characters), we should consider Zero-width_non-joiner (or نیم فاصله in persian) characters too. Unfortunately we have 2 symbols for it. One is standard and the other is not standard but widely used :

\u200C : http://en.wikipedia.org/wiki/Zero-width_non-joiner
\u200F : Right-to-left mark (http://unicode-table.com/en/#200F)

So the final regix can be :

^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F]+$

If you want to consider "space", you can use this :

^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F ]+$

you can test it JavaScript by this :

/^[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF7\u200C\u200F ]+$/.test('ای‌پسر تو چه می‌دانی؟')

Schaffner answered 19/1, 2016 at 5:51 Comment(1)

this regex accepts Persian digits, could you please provide a regex to exclude digits which only allows Arabic and Persian letters (i.e I want a regex to accept first names) – Overcast 2/7, 2017 at 8:55

attention: persianRex is written in Javascript however you can use the source code and copy paste the characters

Detecting Persian characters is a tricky task due to veraiety of keyboard layouts and operating systems. I faced the same challenge sometime before and I decided to write an open source library to fix this issue.

you can fix your issue like this: persianRex.text.test(yourInput); //returns true or false

here is the full documentation: http://imanmh.github.io/persianRex/

Rowlock answered 28/1, 2016 at 7:40 Comment(0)

Farsi, Dari and Tajik are out of my bailiwick, but a little rummaging through the Unicode code charts tells me that Arabic covers 5 Unicode code blocks:

Arabic: http://www.unicode.org/charts/PDF/U0600.pdf
Arabic Supplement: http://www.unicode.org/charts/PDF/U0750.pdf
Arabic Extended-A: http://www.unicode.org/charts/PDF/U08A0.pdf
Arabic Presentation Forms-A: http://www.unicode.org/charts/PDF/UFB50.pdf
Arabic Presentation Forms-B: http://www.unicode.org/charts/PDF/UFE70.pdf

You can get at them (at least some of them) in regular expressions using named blocks instead of explicit code point ranges: \p{IsArabicPresentationForms-A} will give you the 4th Unicode block in the preceding list.

You might also read Persian Computing in Unicode: http://behdad.org/download/Publications/persiancomputing/a007.pdf

Extremism answered 21/3, 2014 at 17:29 Comment(0)

The named blocks, e.g \p{Arabic} cover the entire Arabic script, not just the Persian characters.

The presentation forms (u+FB50-u+FDFF) should not be used in text, and should be converted to the standard range (u+0600-u+06FF).

In order to only cover Persian we need the following:

The subset of Farsi characters out of the standard Arabic range, i.e (U+0621-U+0624, U+0626-U+063A, U+0641-U+0642, U+0644-U+0648)
The standard Arabic diacritics (U+064B-U+0652)
The 2 additional diacritics (U+0654, U+0670)
The 4 extra Farsi characters "گ چ پ ژ" (U+067E, U+0686, U+0698, U+06AF)
U+06A9: Persian Kaf (formally: "Arabic Letter Keheh"; different notation from Arabic Kaf)
U+06CC: Farsi Yeh (a different notation from the Arabic Yeh)
U+200C: Zero-Width-Non-Joiner

So, the resulting regexp would be:

^[\u0621-\u0624\u0626-\u063A\u0641-\u0642\u0644-\u0648\u064B-\u0652\u067E\u0686\u0698\u06AF\u06CC\u06A9\u0654\u670\u200c}]+$

See also the exemplar characters for Persian listed here:

http://unicode.org/cldr/trac/browser/trunk/common/main/fa.xml

Affidavit answered 11/7, 2017 at 21:37 Comment(0)

I'm not sure if regex is the way to do this, however the problem is not specific to only persian or arabic, chinees, russian text. so perhaps you could see if the character is existing in your Codepage, if not in the code page then I doubt the user can insert them using a input device....

 var encoding = Encoding.GetEncoding(1256);
 var expect = "گ چ پ ژ";
 var actual= encoding.GetBytes("گ چ پ ژ");
 Assert.AreEqual(encoding.GetString(actual),expect);

The test tests a round trip where input should match the string to bytes and back. The link shows those code pages supported.

Krenek answered 28/4, 2018 at 13:50 Comment(0)

i searched a lot for validating persian phone numbers with persian characters like ۱۲۳۴ using regex in laravel but found no suitable answer so instead of validating persian number with regex i decided to change peisan numbers to english and validate it myself, it helped me a lot, hope this helps:

if (is_numeric($mobile) && strlen($mobile) == 11) {
      // if number in english
      }else{
            $mobile = str_split($mobile , 2);
            if (count($mobile) != 11) {
                return redirect()->back()->withErrors('فرمت شماره موبایل باید عدد و ۱۱ رقم باشد');
            }
            foreach ($mobile as $key => $number) {
                if ($number == '۰') {
                    $mobile[$key] = 0;
                }elseif ($number == '۱') {
                    $mobile[$key] = 1;
                }elseif ($number == '۲') {
                    $mobile[$key] = 2;
                }elseif ($number == '۳') {
                    $mobile[$key] = 3;
                }elseif ($number == '۴') {
                    $mobile[$key] = 4;
                }elseif ($number == '۵') {
                    $mobile[$key] = 5;
                }elseif ($number == '۶') {
                    $mobile[$key] = 6;
                }elseif ($number == '۷') {
                    $mobile[$key] = 7;
                }elseif ($number == '۸') {
                    $mobile[$key] = 8;
                }elseif ($number == '۹') {
                    $mobile[$key] = 9;
                }
            }
            $mobile = implode($mobile);
            if(is_numeric($mobile) == false){
                return redirect()->back()->withErrors('فرمت شماره موبایل باید عدد و ۱۱ رقم باشد');
            }
        }

Misbeliever answered 8/6, 2021 at 6:23 Comment(0)

this regex checks the Persian characters from 'آ' to 'ی' and contains Persian, and English numbers and special characters such as [.-، and so on].

/^[\u0600-\u06FF\u200C0-9۰-۹.،,_/-\s]+$/

this is a simple example :

let persianRegex = /^[\u0600-\u06FF\u200C0-9۰-۹.،,_/-\s]+$/;


let testString1 =
  " گژپژ،و .که یا - تو عهدی بستم ۱۲۴۵ سلام من دوست خوب تو هست3م"; 
let testString2 = "hello123";

console.log(persianRegex.test(testString1)); // Should return true
console.log(persianRegex.test(testString2)); // Should return false

Xerosere answered 6/6 at 17:13 Comment(0)

-1

just add this code to your TextField or TextFormField

for example:

inputFormatters: [FilteringTextInputFormatter.allow(RegExp("[ آ-ی]"))],

To create an empty space, just enter a space in the RegEx list

♥♥♥خلاصه تمام حروف فارسی رو بدون مشکل میتونی داشته باشی برای فاصله بین حروف هم اسپیس کارو راه میندازه♥♥♥

Geoffreygeoffry answered 11/7, 2022 at 9:24 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

TL;DR

Farsi MUST used character sets are as following:

Why are `[\u0600-\u06FF]` and `[آ-ی]` both wrong?

Although `\u0600-\u06FF` includes:

as well, all answers that suggest `[\u0600-\u06FF]` or `[آ-ی]` are simply WRONG.

i.e. `\u0600-\u06FF` contains 209 more characters than you need! and it includes numbers too!

Whole story

Recommended topics

Hot tags

TL;DR

Farsi MUST used character sets are as following:

Why are [\u0600-\u06FF] and [آ-ی] both wrong?

Although \u0600-\u06FF includes:

as well, all answers that suggest [\u0600-\u06FF] or [آ-ی] are simply WRONG.

i.e. \u0600-\u06FF contains 209 more characters than you need! and it includes numbers too!

Whole story

Recommended topics

Hot tags

Why are `[\u0600-\u06FF]` and `[آ-ی]` both wrong?

Although `\u0600-\u06FF` includes:

as well, all answers that suggest `[\u0600-\u06FF]` or `[آ-ی]` are simply WRONG.

i.e. `\u0600-\u06FF` contains 209 more characters than you need! and it includes numbers too!