EDIT: Okay, don't use any of the answers from here. I wrote them all while thinking Python regular expressions didn't have a word boundary marker, and I tried to work around this lack. Then @Mark Tolonen added a comment that Python has \b
as a word boundary marker! So I posted another answer, short and simple, using \b
. I'll leave this here in case anyone is interested in seeing solutions that work around the lack of \b
, but I don't really expect anyone to be.
It is easy to make a regular expression that matches only a string of a specific set of characters. What you need to use is a "character class" with just the characters you want to match.
I'll do this example in English.
[ocat]
This is a character class that will match a single character from the set [o, c, a, t]
. Order of the characters doesn't matter.
[ocat]+
Putting a +
on the end makes it match one or more characters from the set. But this is not enough by itself; if you had the word "coach" this would match and return "coac".
Sadly, there isn't a regular expression feature for "word boundary". [EDIT: This turns out not to be correct, as I said in the first paragraph.] We need to make one of our own. There are two possible word beginnings: the start of a line, or whitespace separating our word from the previous word. Similarly, there are two possible word endings: end of a line, or whitespace separating our word from the next word.
Since we will be matching some extra stuff we don't want, we can put parentheses around the part of the pattern we do want.
To match two alternatives, we can make a group in parentheses and separate the alternatives with a vertical bar. Python regular expressions have a special notation to make a group whose contents we don't want to keep: (?:)
So, here is the pattern to match the beginning of a word. Start of line or white space: (?:^|\s)
Here is the pattern for end of word. White space or end of line: `(?:\s|$)
Putting it all together, here is our final pattern:
(?:^|\s)([ocat]+)(?:\s|$)
You can build this dynamically. You don't need to hard-code the whole thing.
import re
s_pat_start = r'(?:^|\s)(['
s_pat_end = r']+)(?:\s|$)'
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars is now set to the string: "ocat"
s_pat = s_pat_start + set_of_chars + s_pat_end
pat = re.compile(s_pat)
Now, this doesn't in any way check for valid words. If you have the following text:
This is sensible. This not: occo cttc
The pattern I showed you will match on occo
and cttc
, and those are not really words. They are strings made only of letters from [ocat]
though.
So just do the same thing with Unicode strings. (If you are using Python 3.x then all strings are Unicode strings, so there you go.) Put the Tamil characters in the character class and you are good to go.
This has a confusing problem: re.findall()
doesn't return all possible matches.
EDIT: Okay, I figured out what was confusing me.
What we want is for our pattern to work with re.findall()
so you can collect all the words. But re.findall()
only finds non-overlapping patterns. In my example, re.findall()
only returned ['occo']
and not ['occo', 'cttc']
as I expected... but this is because my pattern was matching the white space after occo
. The match group didn't collect the white space, but it was matched all the same, and since re.findall()
wants no overlap between matches, the white space was "used up" and didn't work for cttc
.
The solution is to use a feature of Python regular expressions that I have never used before: special syntax that says "must not be preceded by" or "must not be followed by". The sequence \S
matches any non-whitespace so we could use that. But punctuation is non-whitespace, and I think we do want punctuation to delimit a word. There is also special syntax for "must be preceded by" or "must be followed by". So here is, I think, the best we can do:
Build a string that means "match when the character class string is at start of line and followed by whitespace, or when character class string is preceded by whitespace and followed by whitespace, or when character class string is preceded by whitespace and followed by end of line, or when character class string is preceded by start of line and followed by end of line".
Here is that pattern using ocat
:
r'(?:^([ocat]+)(?=\s)|(?<=\s)([ocat]+)(?=\s)|(?<=\s)([ocat]+)$|^([ocat]+)$)'
I'm very sorry but I really do think this is the best we can do and still work with re.findall()
!
It's actually less confusing in Python code though:
import re
NMGROUP_BEGIN = r'(?:' # begin non-matching group
NMGROUP_END = r')' # end non-matching group
WS_BEFORE = r'(?<=\s)' # require white space before
WS_AFTER = r'(?=\s)' # require white space after
BOL = r'^' # beginning of line
EOL = r'$' # end of line
CCS_BEGIN = r'([' #begin a character class string
CCS_END = r']+)' # end a character class string
PAT_OR = r'|'
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
CCS = CCS_BEGIN + set_of_chars + CCS_END # build up character class string pattern
s_pat = (NMGROUP_BEGIN +
BOL + CCS + WS_AFTER + PAT_OR +
WS_BEFORE + CCS + WS_AFTER + PAT_OR +
WS_BEFORE + CCS + EOL + PAT_OR +
BOL + CCS + EOL +
NMGROUP_END)
pat = re.compile(s_pat)
text = "This is sensible. This not: occo cttc"
pat.findall(text)
# returns: [('', 'occo', '', ''), ('', '', 'cttc', '')]
So, the crazy thing is that when we have alternative patterns that could match, re.findall()
seems to return an empty string for the alternatives that didn't match. So we just need to filter out the length-zero strings from our results:
import itertools as it
raw_results = pat.findall(text)
results = [s for s in it.chain(*raw_results) if s]
# results set to: ['occo', 'cttc']
I guess it might be less confusing to just build four different patterns, run re.findall()
on each, and join the results together.
EDIT: Okay, here is the code for building four patterns and trying each. I think this is an improvement.
import re
WS_BEFORE = r'(?<=\s)' # require white space before
WS_AFTER = r'(?=\s)' # require white space after
BOL = r'^' # beginning of line
EOL = r'$' # end of line
CCS_BEGIN = r'([' #begin a character class string
CCS_END = r']+)' # end a character class string
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
CCS = CCS_BEGIN + set_of_chars + CCS_END # build up character class string pattern
lst_s_pat = [
BOL + CCS + WS_AFTER,
WS_BEFORE + CCS + WS_AFTER,
WS_BEFORE + CCS + EOL,
BOL + CCS
]
lst_pat = [re.compile(s) for s in lst_s_pat]
text = "This is sensible. This not: occo cttc"
result = []
for pat in lst_pat:
result.extend(pat.findall(text))
# result set to: ['occo', 'cttc']
EDIT: Okay, here is a very different approach. I like this one best.
First, we will match all words in the text. A word is defined as one or more characters that are not punctuation and are not white space.
Then, we use a filter to remove words from the above; we keep only words that are made only of the characters we want.
import re
import string
# Create a pattern that matches all characters not part of a word.
#
# Note that '-' has a special meaning inside a character class, but it
# is valid punctuation that we want to match, so put in a backslash in
# front of it to disable the special meaning and just match it.
#
# Use '^' which negates all the chars following. So, a word is a series
# of characters that are all not whitespace and not punctuation.
WORD_BOUNDARY = string.whitespace + string.punctuation.replace('-', r'\-')
WORD = r'[^' + WORD_BOUNDARY + r']+'
# Create a pattern that matches only the words we want.
set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"
# build up character class string pattern
CCS = r'[' + set_of_chars + r']+'
pat_word = re.compile(WORD)
pat = re.compile(CCS)
text = "This is sensible. This not: occo cttc"
# This makes it clear how we are doing this.
all_words = pat_word.findall(text)
result = [s for s in all_words if pat.match(s)]
# "lazy" generator expression that yields up good results when iterated
# May be better for very large texts.
result_genexp = (s for s in (m.group(0) for m in pat_word.finditer(text)) if pat.match(s))
# force the expression to expand out to a list
result = list(result_genexp)
# result set to: ['occo', 'cttc']
EDIT: Now I don't like any of the above solutions; please see the other answer, the one using \b
, for the best solution in Python.