Regular Expression to accept all Thai characters and English letters in python

Asked 27/7, 2016 at 14:23 Answered 30/5, 2022 at 23:18

I need to vectorize text documents in Thai (e.g Bag of Words, doc2vec).

First I want to go over each document, omitting everything except the Thai characters and English words (e.g. no punctuation, no numbers, no other special characters except apostrophe).

For English documents, I use this regular expression: [^a-zA-Z' ]|^'|'$|''

For Thai documents, I cannot find the right regular expression to use. I know that the Unicode block for Thai is u0E00–u0E7F. I tried [^ก-๛a-zA-Z' ]|^'|'$|'' and many other combinations but they don't succeed.

For example: I want

"ทรูวิชั่นส์ ประกาศถ่ายทอดสดศึกฟุตบอล พรีเมียร์ ลีก อังกฤษ ครบทุกนัดเป็นเวลา 3 ปี ตั้งแต่ฤดูกาล 2016/2017 - 2018/2019 พร้อมด้วยอีก 5 ลีกดัง อาทิ ลา ลีกา สเปน, กัลโช เซเรีย เอ อิตาลี และลีกเอิง ฝรั่งเศส ภายใต้แพ็กเกจสุดคุ้ม ทั้งผ่านมือถือ และโทรทัศน์ some, English words here! abc123"

to be:

"ทรูวิชั่นส์ ประกาศถ่ายทอดสดศึกฟุตบอล พรีเมียร์ ลีก อังกฤษ ครบทุกนัดเป็นเวลา ปี ตั้งแต่ฤดูกาล พร้อมด้วยอีก ลีกดัง อาทิ ลา ลีกา สเปน, กัลโช เซเรีย เอ อิตาลี และลีกเอิง ฝรั่งเศส ภายใต้แพ็กเกจสุดคุ้ม ทั้งผ่านมือถือ และโทรทัศน์ some English words here abc"

Kali answered 27/7, 2016 at 14:23 Comment(5)

Python 2.x or 3.x? – Frightened 27/7, 2016 at 14:26

It doesn't matter, unless I'm missing something, I think regular expressions are the same. – Kali 27/7, 2016 at 14:29

Not technically. Similar, yes, but not the same. Strings are unicode in Python3, while they are ascii in python2, and there is a separate unicode object for unicode. It's easily transposed between the two versions, but it helps make it clear for people who read the question (and future answer) in the future. – Frightened 27/7, 2016 at 14:31

If you do use regex in Python, I would suggest using the unicode codes, instead of the thai characters : [^\u0E00-\u0E7Fa-zA-Z' ]|^'|'$|'' for your regex – Brumley 28/7, 2016 at 8:14

There is a package for this... https://mcmap.net/q/1071965/-regular-expression-to-accept-all-thai-characters-and-english-letters-in-python – Yecies 30/5, 2022 at 23:19

I'll be using some lists to do what I need.

First, let's create the pattern :

pattern = re.compile(r"[^\u0E00-\u0E7Fa-zA-Z' ]|^'|'$|''")

I'll use a string named test_string, containing your example :

test_string="ทรูวิชั่นส์ ประกาศถ่ายทอดสดศึกฟุตบอล พรีเมียร์ ลีก อังกฤษ ครบทุกนัดเป็นเวลา 3 ปี ตั้งแต่ฤดูกาล 2016/2017 - 2018/2019 พร้อมด้วยอีก 5 ลีกดัง อาทิ ลา ลีกา สเปน, กัลโช เซเรีย เอ อิตาลี และลีกเอิง ฝรั่งเศส ภายใต้แพ็กเกจสุดคุ้ม ทั้งผ่านมือถือ และโทรทัศน์ some, English words here! abc123"

First, let's get the characters to remove, in a list :

char_to_remove = re.findall(pattern, test_string)

Then, let's create a list made of the character from our original string, without these characters :

list_with_char_removed = [char for char in test_string if not char in char_to_remove]

We transform this list into a string, and we're done.

result_string = ''.join(list_with_char_removed)

Result is :

'ทรูวิชั่นส์ ประกาศถ่ายทอดสดศึกฟุตบอล พรีเมียร์ ลีก อังกฤษ ครบทุกนัดเป็นเวลา ปี ตั้งแต่ฤดูกาล พร้อมด้วยอีก ลีกดัง อาทิ ลา ลีกา สเปน กัลโช เซเรีย เอ อิตาลี และลีกเอิง ฝรั่งเศส ภายใต้แพ็กเกจสุดคุ้ม ทั้งผ่านมือถือ และโทรทัศน์ some English words here abc'

If you have any cleaner way to do any of the steps/any questions, do not hesitate !

Brumley answered 27/7, 2016 at 15:31 Comment(1)

Thank you! It works! I used (python 2.7): regexp_thai = re.compile(u"[^\u0E00-\u0E7Fa-zA-Z' ]|^'|'$|''") ret = regexp_thai.sub("", text) – Kali 31/7, 2016 at 6:52

In Python 3,

s = "ทรูวิชั่นส์ ประกาศถ่ายทอดสดศึกฟุตบอล พรีเมียร์ ลีก อังกฤษ ครบทุกนัดเป็นเวลา 3 ปี ตั้งแต่ฤดูกาล 2016/2017 - 2018/2019 พร้อมด้วยอีก 5 ลีกดัง อาทิ ลา ลีกา สเปน, กัลโช เซเรีย เอ อิตาลี และลีกเอิง ฝรั่งเศส ภายใต้แพ็กเกจสุดคุ้ม ทั้งผ่านมือถือ และโทรทัศน์ some, English words here! abc123"
pattern = re.compile(r"(?:[^\d\W]+)|\s")
for each in pattern.findall(s): print(each, end="")

Outputs this:

ทรวชนส ประกาศถายทอดสดศกฟตบอล พรเมยร ลก องกฤษ ครบทกนดเปนเวลา  ป ตงแตฤดกาล    พรอมดวยอก  ลกดง อาท ลา ลกา สเปน กลโช เซเรย เอ อตาล และลกเอง ฝรงเศส ภายใตแพกเกจสดคม ทงผานมอถอ และโทรทศน some English words here

Accents are being removed, so this is not a perfect answer. I'm currently looking around to see why this is happening.

EDIT: Using the character range from HolyDanna's answer, you can keep the accents. Interesting that just using word does not keep accents (this is probably due to how unicode code points add accents as another code point after the accented character, but seems like a bug). It also has the side effect of removing characters from other languages. Just replace the compile line HolyDanna's:

pattern = re.compile(r"[\u0E00-\u0E7Fa-zA-Z' ]")

You can get rid of the apostrophe (etc) if you don't want it.

Frightened answered 27/7, 2016 at 14:46 Comment(6)

I'm looking for something not specific to my example but that will work for every Thai text document that may include special characters that are not included in this example. Also, I want numbers to be removed from words so for example abc123 --> abc – Kali 27/7, 2016 at 14:52

This will remove any special characters in any language in any situation. The only thing that is allowed is "word" characters that are not also numbers, and whitespace. It is not specific to your example. – Frightened 27/7, 2016 at 14:53

Thank you. But what if my document has characters in other languages? – Kali 27/7, 2016 at 15:3

This ia an interesting question. [A-Za-z]+ matches a word written in English (explicitly excluding accented letters and suchlike). What I don't know is whether the larger sets of unicodes which make up words in Thai or other non-European languages can be specified as a range, or whether they are scattered all over the place in Unicode. In the latter case is there any way other than building up the RE as a string incorporating variables? i.e."["+thai_alphabet+"]+" – Possum 27/7, 2016 at 15:30

@ShaniShalgi Are you saying specifically that you want to remove other language characters? Or that you want to permit other language characters? – Frightened 27/7, 2016 at 15:36

remove all other language characters (except English in my specific case, but let's first start with just keeping Thai). – Kali 31/7, 2016 at 6:38

Sadly, there are not many regular expression libraries with good Unicode support, and Python's re library is one of them. Oniguruma has proper Unicode support and I believe it has Python bindings, and Perl's built-in regular expressions have good Unicode support.

I normally don't suggest that people switch languages, but in this case, you will save a lot of trouble by using Perl (and for the record, I have the gold Python badge, and haven't touched Perl in the past decade!). Here is a taste of how simple it is (it should be the same in Oniguruma, which again, I think has Python bindings):

[^\p{Latin}\p{Thai}]+

Here is Perl example code:

#!/usr/bin/perl -w
use utf8;
$_ = "ทรูวิชั่นส์ ประ...abc123";
s/[^\p{Latin}\p{Thai}]+/ /g;
print;
print "\n";

Here is the output:

ทรูวิชั่นส์ ประกาศถ่ายทอดสดศึกฟุตบอล พรีเมียร์ ลีก อังกฤษ ครบทุกนัดเป็นเวลา ปี ตั้งแต่ฤดูกาล พร้อมด้วยอีก ลีกดัง อาทิ ลา ลีกา สเปน กัลโช เซเรีย เอ อิตาลี และลีกเอิง ฝรั่งเศส ภายใต้แพ็กเกจสุดคุ้ม ทั้งผ่านมือถือ และโทรทัศน์ some English words here abc

Dillydally answered 27/7, 2016 at 15:35 Comment(7)

Weird to post a Perl answer to a Python question, but this should be straightforward in Python 3. – Cyclorama 18/7, 2019 at 2:57

@tripleee: If it’s straightforward in Python 3, how do you do it? – Dillydally 18/7, 2019 at 5:34

I went looking for a duplicate but couldn't quickly find one. I'm pretty sure there are related questions about this so I didn't pursue reinventing the solution. https://mcmap.net/q/386565/-how-can-i-check-if-a-python-unicode-string-contains-non-western-letters is pretty close. – Cyclorama 18/7, 2019 at 7:20

@tripleee: Actually, that's a really bad answer you've linked. It uses string searching inside Unicode character names which is a very hackneyed and inexact way of getting what you want, which is the Unicode script property. This is a good demonstration of why Perl's Unicode support is miles ahead of Python's. – Dillydally 18/7, 2019 at 18:28

Thanks for the feedback - I agree that that particular answer is not particularly good. But Python does have a unicodedata module which allows you to do things like this, and regex support for some Unicode properties. I'll try to find a better answer to link to, maybe tomorrow – Cyclorama 18/7, 2019 at 18:31

@tripleee: You can see the documentation here docs.python.org/3/library/unicodedata.html — the properties we are interested in, the script property, are missing from this module. You would need to use a third-party library like PyICU, or you could use a more capable regex engine like Oniguruma (which has Python bindings). – Dillydally 18/7, 2019 at 18:34

Thanks for the hint, you are correct of course. My vote goes to the regex module but that's mainly because I use it for other reasons anyway. See further #1833393 – Cyclorama 18/7, 2019 at 19:1

In Java you can match a combination of Thai en English with: ^[\\p{L}\\p{javaUnicodeIdentifierPart}\\p{Blank}\\p{P}]*$

Breakdown:

\\p{L} is an 'normal' letter
\\p{javaUnicodeIdentifierPart} matches a Thai letter
\\p{Blank} matches a space character
\\p{P} matches punctuation.

I'm not an expert in the Thai language (other than that I recognize it), but without the punctuation-match the string does not match.

Creech answered 17/10, 2021 at 4:23 Comment(0)

The simplest solution is to use the regex package.
Regex package is backwards-compatible to re.
pip install regex

import regex
m = regex.match('[\p{Latin}\p{Thai}]+', 'ทรูวิชั่นส์asdf')
m.captures()  # == ['ทรูวิชั่นส์asdf']

Yecies answered 30/5, 2022 at 23:18 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags