Regex to get list of all words with specific letters (unicode graphemes)
Asked Answered
C

4

7

I'm writing a Python script for a FOSS language learning initiative. Let's say I have an XML file (or to keep it simple, a Python list) with a list of words in a particular language (in my case, the words are in Tamil, which uses a Brahmi-based Indic script).

I need to draw out the subset of those words that can be spelled using just those letters.

An English example:

words = ["cat", "dog", "tack", "coat"] 

get_words(['o', 'c', 'a', 't']) should return ["cat", "coat"]
get_words(['k', 'c', 't', 'a']) should return ["cat", "tack"]

A Tamil example:

words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]

get_words([u'ம', u'ப', u'ட', u'ம்')  should return [u"மடம்", u"படம்")
get_words([u'ப', u'ம்', u'ட') should return [u"படம்"] 

The order that the words are returned in, or the order that the letters are entered in should not make a difference.

Although I understand the difference between unicode codepoints and graphemes, I'm not sure how they're handled in regular expressions.

In this case, I would want to match only those words that are made up of the specific graphemes in the input list, and nothing else (i.e. the markings that follow a letter should only follow that letter, but the graphemes themselves can occur in any order).

Cretan answered 27/1, 2013 at 3:17 Comment(0)
H
5

To support characters that can span several Unicode codepoints:

# -*- coding: utf-8 -*-
import re
import unicodedata
from functools import partial

NFKD = partial(unicodedata.normalize, 'NFKD')

def match(word, letters):
    word, letters = NFKD(word), map(NFKD, letters) # normalize
    return re.match(r"(?:%s)+$" % "|".join(map(re.escape, letters)), word)

words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]
get_words = lambda letters: [w for w in words if match(w, letters)]

print(" ".join(get_words([u'ம', u'ப', u'ட', u'ம்'])))
# -> மடம் படம்
print(" ".join(get_words([u'ப', u'ம்', u'ட'])))
# -> படம்

It assumes that the same character can be used zero or more times in a word.

If you want only words that contain exactly given characters:

import regex # $ pip install regex

chars = regex.compile(r"\X").findall # get all characters

def match(word, letters):
    return sorted(chars(word)) == sorted(letters)

words = ["cat", "dog", "tack", "coat"]

print(" ".join(get_words(['o', 'c', 'a', 't'])))
# -> coat
print(" ".join(get_words(['k', 'c', 't', 'a'])))
# -> tack

Note: there is no cat in the output in this case because cat doesn't use all given characters.


What does it mean to normalize? And could you please explain the syntax of the re.match() regex?

>>> import re
>>> re.escape('.')
'\\.'
>>> c = u'\u00c7'
>>> cc = u'\u0043\u0327'
>>> cc == c
False
>>> re.match(r'%s$' % (c,), cc) # do not match
>>> import unicodedata
>>> norm = lambda s: unicodedata.normalize('NFKD', s)
>>> re.match(r'%s$' % (norm(c),), norm(cc)) # do match
<_sre.SRE_Match object at 0x1364648>
>>> print c, cc
Ç Ç

Without normalization c and cc do not match. The characters are from the unicodedata.normalize() docs.

Hyams answered 28/1, 2013 at 6:23 Comment(4)
this answer looks great. I was wondering if you could provide a bit of explanation on the match() function. What does it mean to normalize? And could you please explain the syntax of the re.match() regex? Thanks againComments may only be edited for 5 minutes(click on this box to dismiss)Cretan
re.match() is well documented in the PYthon documentation. docs.python.org/3.3/library/re.html#search-vs-matchAlexandria
unicodedata.normalize() is documented here: docs.python.org/3.3/library/unicodedata.html#module-unicodedataAlexandria
@AshwinBalamohan: I've added example for unicodedata.normalize() to explain why it might be needed. What part of the regex is not clear: (?:), |, +, $?Hyams
A
3

EDIT: Okay, don't use any of the answers from here. I wrote them all while thinking Python regular expressions didn't have a word boundary marker, and I tried to work around this lack. Then @Mark Tolonen added a comment that Python has \b as a word boundary marker! So I posted another answer, short and simple, using \b. I'll leave this here in case anyone is interested in seeing solutions that work around the lack of \b, but I don't really expect anyone to be.


It is easy to make a regular expression that matches only a string of a specific set of characters. What you need to use is a "character class" with just the characters you want to match.

I'll do this example in English.

[ocat] This is a character class that will match a single character from the set [o, c, a, t]. Order of the characters doesn't matter.

[ocat]+ Putting a + on the end makes it match one or more characters from the set. But this is not enough by itself; if you had the word "coach" this would match and return "coac".

Sadly, there isn't a regular expression feature for "word boundary". [EDIT: This turns out not to be correct, as I said in the first paragraph.] We need to make one of our own. There are two possible word beginnings: the start of a line, or whitespace separating our word from the previous word. Similarly, there are two possible word endings: end of a line, or whitespace separating our word from the next word.

Since we will be matching some extra stuff we don't want, we can put parentheses around the part of the pattern we do want.

To match two alternatives, we can make a group in parentheses and separate the alternatives with a vertical bar. Python regular expressions have a special notation to make a group whose contents we don't want to keep: (?:)

So, here is the pattern to match the beginning of a word. Start of line or white space: (?:^|\s)

Here is the pattern for end of word. White space or end of line: `(?:\s|$)

Putting it all together, here is our final pattern:

(?:^|\s)([ocat]+)(?:\s|$)

You can build this dynamically. You don't need to hard-code the whole thing.

import re

s_pat_start = r'(?:^|\s)(['
s_pat_end = r']+)(?:\s|$)'

set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars is now set to the string: "ocat"

s_pat = s_pat_start + set_of_chars + s_pat_end
pat = re.compile(s_pat)

Now, this doesn't in any way check for valid words. If you have the following text:

This is sensible.  This not: occo cttc

The pattern I showed you will match on occo and cttc, and those are not really words. They are strings made only of letters from [ocat] though.

So just do the same thing with Unicode strings. (If you are using Python 3.x then all strings are Unicode strings, so there you go.) Put the Tamil characters in the character class and you are good to go.

This has a confusing problem: re.findall() doesn't return all possible matches.

EDIT: Okay, I figured out what was confusing me.

What we want is for our pattern to work with re.findall() so you can collect all the words. But re.findall() only finds non-overlapping patterns. In my example, re.findall() only returned ['occo'] and not ['occo', 'cttc'] as I expected... but this is because my pattern was matching the white space after occo. The match group didn't collect the white space, but it was matched all the same, and since re.findall() wants no overlap between matches, the white space was "used up" and didn't work for cttc.

The solution is to use a feature of Python regular expressions that I have never used before: special syntax that says "must not be preceded by" or "must not be followed by". The sequence \S matches any non-whitespace so we could use that. But punctuation is non-whitespace, and I think we do want punctuation to delimit a word. There is also special syntax for "must be preceded by" or "must be followed by". So here is, I think, the best we can do:

Build a string that means "match when the character class string is at start of line and followed by whitespace, or when character class string is preceded by whitespace and followed by whitespace, or when character class string is preceded by whitespace and followed by end of line, or when character class string is preceded by start of line and followed by end of line".

Here is that pattern using ocat:

r'(?:^([ocat]+)(?=\s)|(?<=\s)([ocat]+)(?=\s)|(?<=\s)([ocat]+)$|^([ocat]+)$)'

I'm very sorry but I really do think this is the best we can do and still work with re.findall()!

It's actually less confusing in Python code though:

import re

NMGROUP_BEGIN = r'(?:'  # begin non-matching group
NMGROUP_END = r')'  # end non-matching group

WS_BEFORE = r'(?<=\s)'  # require white space before
WS_AFTER = r'(?=\s)'  # require white space after

BOL = r'^' # beginning of line
EOL = r'$' # end of line

CCS_BEGIN = r'(['  #begin a character class string
CCS_END = r']+)'  # end a character class string

PAT_OR = r'|'

set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"

CCS = CCS_BEGIN + set_of_chars + CCS_END  # build up character class string pattern

s_pat = (NMGROUP_BEGIN +
    BOL + CCS + WS_AFTER + PAT_OR +
    WS_BEFORE + CCS + WS_AFTER + PAT_OR +
    WS_BEFORE + CCS + EOL + PAT_OR +
    BOL + CCS + EOL +
    NMGROUP_END)

pat = re.compile(s_pat)

text = "This is sensible.  This not: occo cttc"

pat.findall(text)
# returns: [('', 'occo', '', ''), ('', '', 'cttc', '')]

So, the crazy thing is that when we have alternative patterns that could match, re.findall() seems to return an empty string for the alternatives that didn't match. So we just need to filter out the length-zero strings from our results:

import itertools as it

raw_results = pat.findall(text)
results = [s for s in it.chain(*raw_results) if s]
# results set to: ['occo', 'cttc']

I guess it might be less confusing to just build four different patterns, run re.findall() on each, and join the results together.

EDIT: Okay, here is the code for building four patterns and trying each. I think this is an improvement.

import re

WS_BEFORE = r'(?<=\s)'  # require white space before
WS_AFTER = r'(?=\s)'  # require white space after

BOL = r'^' # beginning of line
EOL = r'$' # end of line

CCS_BEGIN = r'(['  #begin a character class string
CCS_END = r']+)'  # end a character class string

set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"

CCS = CCS_BEGIN + set_of_chars + CCS_END  # build up character class string pattern

lst_s_pat = [
    BOL + CCS + WS_AFTER,
    WS_BEFORE + CCS + WS_AFTER,
    WS_BEFORE + CCS + EOL,
    BOL + CCS
]

lst_pat = [re.compile(s) for s in lst_s_pat]

text = "This is sensible.  This not: occo cttc"

result = []
for pat in lst_pat:
    result.extend(pat.findall(text))

# result set to: ['occo', 'cttc']

EDIT: Okay, here is a very different approach. I like this one best.

First, we will match all words in the text. A word is defined as one or more characters that are not punctuation and are not white space.

Then, we use a filter to remove words from the above; we keep only words that are made only of the characters we want.

import re
import string

# Create a pattern that matches all characters not part of a word.
#
# Note that '-' has a special meaning inside a character class, but it
# is valid punctuation that we want to match, so put in a backslash in
# front of it to disable the special meaning and just match it.
#
# Use '^' which negates all the chars following.  So, a word is a series
# of characters that are all not whitespace and not punctuation.

WORD_BOUNDARY = string.whitespace + string.punctuation.replace('-', r'\-')

WORD = r'[^' + WORD_BOUNDARY + r']+'


# Create a pattern that matches only the words we want.

set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"

# build up character class string pattern
CCS = r'[' + set_of_chars + r']+'


pat_word = re.compile(WORD)
pat = re.compile(CCS)

text = "This is sensible.  This not: occo cttc"


# This makes it clear how we are doing this.
all_words = pat_word.findall(text)
result = [s for s in all_words if pat.match(s)]

# "lazy" generator expression that yields up good results when iterated
# May be better for very large texts.
result_genexp = (s for s in (m.group(0) for m in pat_word.finditer(text)) if pat.match(s))

# force the expression to expand out to a list
result = list(result_genexp)

# result set to: ['occo', 'cttc']

EDIT: Now I don't like any of the above solutions; please see the other answer, the one using \b, for the best solution in Python.

Alexandria answered 27/1, 2013 at 5:45 Comment(2)
Sadly, there isn't a regular expression feature for "word boundary" ... See \b under Python's Regular Expression Syntax.Paeon
@MarkTolonen, thank you very much for educating me on this point. I had read through all the backslash escapes and failed to recognize that \b was what I was looking for. The bit about "match the null string" at the beginning threw me off enough that the part about "only at word boundaries" didn't click for me. I have posted another answer using \b and it is so much simpler and easier! So, thank you. Even if I don't get any points for my answer it was worth it to learn about \b.Alexandria
A
3

It is easy to make a regular expression that matches only a string of a specific set of characters. What you need to use is a "character class" with just the characters you want to match.

I'll do this example in English.

[ocat] This is a character class that will match a single character from the set [o, c, a, t]. Order of the characters doesn't matter.

[ocat]+ Putting a + on the end makes it match one or more characters from the set. But this is not enough by itself; if you had the word "coach" this would match and return "coac".

\b[ocat]+\b' Now it only matches on word boundaries. (Thank you very much @Mark Tolonen for educating me about\b`.)

So, just build up a pattern like the above, only using the desired character set at runtime, and there you go. You can use this pattern with re.findall() or re.finditer().

import re

words = ["cat", "dog", "tack", "coat"]

def get_words(chars_seq, words_seq=words):
    s_chars = ''.join(chars_seq)
    s_pat = r'\b[' + s_chars + r']+\b'
    pat = re.compile(s_pat)
    return [word for word in words_seq if pat.match(word)]

assert get_words(['o', 'c', 'a', 't']) == ["cat", "coat"]
assert get_words(['k', 'c', 't', 'a']) == ["cat", "tack"]
Alexandria answered 27/1, 2013 at 22:41 Comment(5)
The solution ignores Unicode issues (it might produce false positives). Also you don't need \b if you compare only with full words, just add $ to match the whole string e.g., re.match("[abc]+$", word).Hyams
@J.F.Sebastian I'm used to the basic ASCII characters, which are complete unto themselves. Does Tamil use characters that combine together? In any event, do you have any suggestions for how to improve the answer? Also, I used \b because I wanted a pattern that could pull the words out of the source text. Yes, if we are only solving the problem of writing get_words() we could anchor the pattern with ^ and $.Alexandria
the question explicitly mentions codepoints and user-perceived character (graphemes) therefore the answer must be Unicode-aware. Look at my answer.Hyams
Steveha, I really appreciate the effort you took in providing a detailed explanation. I have a very good grasp of Python regex after reading your post. As @J.F. Sebastian noted though, the codepoint-grapheme distinction with unicode is a little annoying. Conceptually, it's comparable to having accents in French or Spanish letters (which, as I understand it, also get their own codepoints). We'd want to force the accents to stick with the letter they were attached to, and not be able to combine with any other letter. While the graphemes themselves could be in any order, the codepoints cannotCretan
You are welcome! I learned about \b and I learned some stuff about Unicode, and learning is even better than points. :-)Alexandria
B
2

I would not use regular expressions to solve this problem. I would rather use collections.Counter like so:

>>> from collections import Counter
>>> def get_words(word_list, letter_string):
    return [word for word in word_list if Counter(word) & Counter(letter_string) == Counter(word)]
>>> words = ["cat", "dog", "tack", "coat"]
>>> letters = 'ocat'
>>> get_words(words, letters)
['cat', 'coat']
>>> letters = 'kcta'
>>> get_words(words, letters)
['cat', 'tack']

This solution should also work for other languages. Counter(word) & Counter(letter_string) finds the intersection between the two counters, or the min(c[x], f[x]). If this intersection is equivalent to your word, then you want to return the word as a match.

Brigandine answered 27/1, 2013 at 3:59 Comment(4)
Counter(word) counts codepoints, not graphemes. It might result in false positives i.e., the function might return words with characters that are not in the letter_stringHyams
Thanks for the response. I opted for Regex because these strings would be in an XML file (specifically, a dump of the Wiktionary API). I should have emphasized this more, but didn't want to muddle the XML issues with the regex and unicode ones. Given the size of the API, it would be resource-intensive to generate an array of the entire Tamil Wiktionary API. I figured if we could match them against a list using a regex, we could use the same regular expression in the XML search.Cretan
To fix it, you could use Counter(chars(word)) where chars = lambda word: regex.findall(ur"\X", word).Hyams
@AshwinBalamohan You don't have to generate the whole array at once, you can use a generator function that would generate words from xml on the fly.Cochineal

© 2022 - 2024 — McMap. All rights reserved.