How can I create a regex from a list of words?

Asked 18/2, 2013 at 21:26 Answered 18/2, 2013 at 21:59

I have a dict of words (actually I have nested dicts of verb conjugations, but that isn't relevant) and I want to make a regex by combining them.

{
  'yo': 'hablaba',
  'tú': 'hablabas',
  'él': 'hablaba',
  'nosotros': 'hablábamos',
  'vosotros': 'hablabais',
  'ellos': 'hablaban',
  'vos': 'hablabas',
}

... to make:

'habl((aba(s|is|n)?)|ábamos)' # I think that's right

If I don't include 'hablábamos' it's easy - they're all the same prefix, and I can get:

'hablaba(s|is|n)?'

... but I want a general form. Is that possible?

Porky answered 18/2, 2013 at 21:26 Comment(3)

Are you trying to generate a regular expression from the values in the dictionary? Or are you trying to write a regular expression to validate the values in the dictionary. Or something else entirely? – Keyhole 18/2, 2013 at 21:30

I want to generate it. Is my tag wrong? – Porky 18/2, 2013 at 21:33

There is a JavaScript library that will do this for you: github.com/devongovett/regexgen (maybe Python has something similar?) – Warranty 12/4, 2019 at 21:34

Yes, I believe this is possible.

To get you started, this is how I would break down the problem.

Calculate the root by finding the longest possible string that matches the start of all of the declined values:

>>> root = ''
>>> for c in hablar['yo']:
...     if all(v.startswith(root + c) for v in hablar.itervalues()):
...         root += c
...     else:
...        break
... 
>>> root
'habl'

Whatever's left of the words makes a list of endings.

>>> endings = [v[len(root):] for v in hablar.itervalues()]
>>> print endings
['abas', 'aba', 'abais', 'aba', '\xc3\xa1bamos', 'aban', 'abas']

You may then want to weed out the duplicates:

>>> unique_endings = set(endings)
>>> print unique_endings
set(['abas', 'abais', '\xc3\xa1bamos', 'aban', 'aba'])

Then join these endings together with pipes:

>>> conjoined_endings = '|'.join(unique_endings)
>>> print conjoined_endings
abas|abais|ábamos|aban|aba

Forming the regular expression is a simple matter combining the root and the conjoined_endings string in parentheses:

>>> final_regex = '{}({})'.format(root, conjoined_endings)
>>> print final_regex
habl(abas|abais|ábamos|aban|aba)

Keyhole answered 18/2, 2013 at 21:59 Comment(3)

Thank you @Johnsyweb, yes that helps. But I cant vote you up :( "requires 15 reputation". Shall I accept you? – Porky 18/2, 2013 at 22:18

@MalenaTorres: you're welcome. Hopefully this will get you started even if my linguistic terms are off. I'm curious as to why you want to compress the regular expressions so much, you're not dealing with large amount of data and more complex expressions are only going to add to your validation times. – Keyhole 19/2, 2013 at 0:55

I made my example more simple than it is, really it will be like {'yo': '\w+aba'}, &c. Finally I want to compare irregular verbs to regular for their rules, and I will have another dict like yo = {'imperfecto': '\w+aba', 'presente': '\w+o'}. More complicated for irregular verbs though, now I am just starting out with my idea to see what I can do. – Porky 19/2, 2013 at 15:12

I think you need to have a less clever approach

>>> x={
...   'yo': 'hablaba',
...   'tú': 'hablabas',
...   'él': 'hablaba',
...   'nosotros': 'hablábamos',
...   'vosotros': 'hablabais',
...   'ellos': 'hablaban',
...   'vos': 'hablabas',
... }
>>> x
{'t\xc3\xba': 'hablabas', 'yo': 'hablaba', 'vosotros': 'hablabais', '\xc3\xa9l': 'hablaba', 'nosotros': 'habl\xc3\xa1bamos', 'ellos': 'hablaban', 'vos': 'hablabas'}
>>> x.values
<built-in method values of dict object at 0x20e6490>
>>> x.values()
['hablabas', 'hablaba', 'hablabais', 'hablaba', 'habl\xc3\xa1bamos', 'hablaban', 'hablabas']
>>> "|".join(x.values())
'hablabas|hablaba|hablabais|hablaba|habl\xc3\xa1bamos|hablaban|hablabas'

If you just join the hash values with an alternation operator then it should do what you want

Scissile answered 18/2, 2013 at 21:33 Comment(4)

Thank you Vorsprung :) But I have lots of words and other conjugations (that one I gave is imperfect conjugation, there are about 15 more) and I dont want to use too much space. But yes your idea works :) – Porky 18/2, 2013 at 21:42

I always think that computer memory is cheaper than my valuable time :) – Scissile 18/2, 2013 at 22:11

There is certainly a lot to be said for keeping it simple! – Keyhole 19/2, 2013 at 0:56

@MalenaTorres: please add that motivating detail to the original question statement above. It makes a difference. Also, do you really want the regex with the fewest chars? (or just a fairly optimal '|'-separated concatenated list taking advantage of shared prefixes?) – Expurgatory 1/12, 2017 at 2:23

Recommended topics

Hot tags