I did something like that once, you can probably pick it up and adapt it to your needs.
TL;DR: thanks to Python's eval
, you doing this is a breeze.
The problem was to parse dates and durations in textual form. What I did was to create a yaml file mapping regex pattern to the result. The mapping itself was a python expression that would be evaluated with the match object, and had access to other functions and variables defined elsewhere in the file.
For example, the following self-contained snippet would recognize times like "l'11 agosto del 1993" (Italian for "August 11th, 1993,).
__meta_vars__:
month: (gennaio|febbraio|marzo|aprile|maggio|giugno|luglio|agosto|settembre|ottobre|novembre|dicembre)
prep_art: (il\s|l\s?'\s?|nel\s|nell\s?'\s?|del\s|dell\s?'\s?)
schema:
date: http://www.w3.org/2001/XMLSchema#date
__meta_func__:
- >
def month_to_num(month):
""" gennaio -> 1, febbraio -> 2, ..., dicembre -> 12 """
try:
return index_in_or(meta_vars['month'], month) + 1
except ValueError:
return month
Tempo:
- \b{prep_art}(?P<day>\d{{1,2}}) (?P<month>{month}) {prep_art}?\s*(?P<year>\d{{4}}): >
'"{}-{:02d}-{:02d}"^^<{schema}>'.format(match.group('year'),
month_to_num(match.group('month')),
int(match.group('day')),
schema=schema['date'])
__meta_func__
and __meta_vars
(not the best names, I know) define functions and variables that are accessible to the match transformation rules. To make the rules easier to write, the pattern is formatted by using the meta-variables, so that {month}
is replaced with the pattern matching all months. The transformation rule calls the meta-function month_to_num
to convert the month to a number from 1 to 12, and reads from the schema
meta-variable. On the example above, the match results in the string "1993-08-11"^^<http://www.w3.org/2001/XMLSchema#date>
, but some other rules would produce a dictionary.
Doing this is quite easy in Python, as you can use exec
to evaluate strings as Python code (obligatory warning about security implications). The meta-functions and meta-variables are evaluated and stored in a dictionary, which is then passed to the match transformation rules.
The code is on github, feel free to ask any questions if you need clarifications. Relevant parts, slightly edited:
class DateNormalizer:
def _meta_init(self, specs):
""" Reads the meta variables and the meta functions from the specification
:param dict specs: The specifications loaded from the file
:return: None
"""
self.meta_vars = specs.pop('__meta_vars__')
# compile meta functions in a dictionary
self.meta_funcs = {}
for f in specs.pop('__meta_funcs__'):
exec f in self.meta_funcs
# make meta variables available to the meta functions just defined
self.meta_funcs['__builtins__']['meta_vars'] = self.meta_vars
self.globals = self.meta_funcs
self.globals.update(self.meta_vars)
def normalize(self, expression):
""" Find the first matching part in the given expression
:param str expression: The expression in which to search the match
:return: Tuple with (start, end), category, result
:rtype: tuple
"""
expression = expression.lower()
for category, regexes in self.regexes.iteritems():
for regex, transform in regexes:
match = regex.search(expression)
if match:
result = eval(transform, self.globals, {'match': match})
start, end = match.span()
return (first_position + start, first_position + end) , category, result