Efficiently match multiple regexes in Python

B

6

17

Lexical analyzers are quite easy to write when you have regexes. Today I wanted to write a simple general analyzer in Python, and came up with:

import re
import sys

class Token(object):
    """ A simple Token structure.
        Contains the token type, value and position. 
    """
    def __init__(self, type, val, pos):
        self.type = type
        self.val = val
        self.pos = pos

    def __str__(self):
        return '%s(%s) at %s' % (self.type, self.val, self.pos)


class LexerError(Exception):
    """ Lexer error exception.

        pos:
            Position in the input line where the error occurred.
    """
    def __init__(self, pos):
        self.pos = pos


class Lexer(object):
    """ A simple regex-based lexer/tokenizer.

        See below for an example of usage.
    """
    def __init__(self, rules, skip_whitespace=True):
        """ Create a lexer.

            rules:
                A list of rules. Each rule is a `regex, type`
                pair, where `regex` is the regular expression used
                to recognize the token and `type` is the type
                of the token to return when it's recognized.

            skip_whitespace:
                If True, whitespace (\s+) will be skipped and not
                reported by the lexer. Otherwise, you have to 
                specify your rules for whitespace, or it will be
                flagged as an error.
        """
        self.rules = []

        for regex, type in rules:
            self.rules.append((re.compile(regex), type))

        self.skip_whitespace = skip_whitespace
        self.re_ws_skip = re.compile('\S')

    def input(self, buf):
        """ Initialize the lexer with a buffer as input.
        """
        self.buf = buf
        self.pos = 0

    def token(self):
        """ Return the next token (a Token object) found in the 
            input buffer. None is returned if the end of the 
            buffer was reached. 
            In case of a lexing error (the current chunk of the
            buffer matches no rule), a LexerError is raised with
            the position of the error.
        """
        if self.pos >= len(self.buf):
            return None
        else:
            if self.skip_whitespace:
                m = self.re_ws_skip.search(self.buf[self.pos:])

                if m:
                    self.pos += m.start()
                else:
                    return None

            for token_regex, token_type in self.rules:
                m = token_regex.match(self.buf[self.pos:])

                if m:
                    value = self.buf[self.pos + m.start():self.pos + m.end()]
                    tok = Token(token_type, value, self.pos)
                    self.pos += m.end()
                    return tok

            # if we're here, no rule matched
            raise LexerError(self.pos)

    def tokens(self):
        """ Returns an iterator to the tokens found in the buffer.
        """
        while 1:
            tok = self.token()
            if tok is None: break
            yield tok


if __name__ == '__main__':
    rules = [
        ('\d+',             'NUMBER'),
        ('[a-zA-Z_]\w+',    'IDENTIFIER'),
        ('\+',              'PLUS'),
        ('\-',              'MINUS'),
        ('\*',              'MULTIPLY'),
        ('\/',              'DIVIDE'),
        ('\(',              'LP'),
        ('\)',              'RP'),
        ('=',               'EQUALS'),
    ]

    lx = Lexer(rules, skip_whitespace=True)
    lx.input('erw = _abc + 12*(R4-623902)  ')

    try:
        for tok in lx.tokens():
            print tok
    except LexerError, err:
        print 'LexerError at position', err.pos

It works just fine, but I'm a bit worried that it's too inefficient. Are there any regex tricks that will allow me to write it in a more efficient / elegant way ?

Specifically, is there a way to avoid looping over all the regex rules linearly to find one that fits?

Bryonbryony answered 25/9, 2008 at 15:10 Comment(0)

U

8

You can merge all your regexes into one using the "|" operator and let the regex library do the work of discerning between tokens. Some care should be taken to ensure the preference of tokens (for example to avoid matching a keyword as an identifier).

Uttasta answered 25/9, 2008 at 15:54 Comment(5)

How do I make it return the right type for each one of the choices ? – Bryonbryony 25/9, 2008 at 16:7

Use capturing groups. Enclosing a part of a regex in parentheses makes it a capturing group that can be retrieved from the match object, for example re.match("(a)|(b)","b").groups() = (None,"b"). The first group didn't match, the second one matched "b". – Guadalajara 25/9, 2008 at 17:14

But I'll still have to linearly walk over the capture groups ? – Bryonbryony 26/9, 2008 at 4:51

I think that using named capture groups, together with the lastgroup attribute of the match object lets you avoid the walk. For example re.match("(?P<ag>a)|(?P<bg>b)","b").lastgroup='bg' – Guadalajara 26/9, 2008 at 7:14

@EliBendersky: Not if the regexp implementation is smart enough. Your alternatives will be merged in a way that if they have some common prefixes, they will be recognized in a single pass, and only the differing characters will make a split. For example, this pattern: "for|foreach|forbidden" and the string "foreach" it should match first three letters (the common prefix) only once, and then choosing the correct path depending on the first non-common character, here 'e' would choose the 2nd option and verify if the rest of it, 'ach', still matches. If it doesn't, none of the others can either. – Ourselves 23/5, 2014 at 22:52

S

12

I suggest using the re.Scanner class, it's not documented in the standard library, but it's well worth using. Here's an example:

import re

scanner = re.Scanner([
    (r"-?[0-9]+\.[0-9]+([eE]-?[0-9]+)?", lambda scanner, token: float(token)),
    (r"-?[0-9]+", lambda scanner, token: int(token)),
    (r" +", lambda scanner, token: None),
])

>>> scanner.scan("0 -1 4.5 7.8e3")[0]
[0, -1, 4.5, 7800.0]

Sampler answered 9/11, 2010 at 16:59 Comment(2)

I think tokens should be a list of (text, tag) pairs. Returning just matched values sequence wouldn't be much useful for subsequent parsing. – Victorious 28/10, 2014 at 13:50

It's also rather unfortunate that this isn't implemented as a generator. It parses the whole thing in one go and returns a list. – Metamer 25/8, 2015 at 10:39

U

8

You can merge all your regexes into one using the "|" operator and let the regex library do the work of discerning between tokens. Some care should be taken to ensure the preference of tokens (for example to avoid matching a keyword as an identifier).

Uttasta answered 25/9, 2008 at 15:54 Comment(5)

How do I make it return the right type for each one of the choices ? – Bryonbryony 25/9, 2008 at 16:7

Use capturing groups. Enclosing a part of a regex in parentheses makes it a capturing group that can be retrieved from the match object, for example re.match("(a)|(b)","b").groups() = (None,"b"). The first group didn't match, the second one matched "b". – Guadalajara 25/9, 2008 at 17:14

But I'll still have to linearly walk over the capture groups ? – Bryonbryony 26/9, 2008 at 4:51

I think that using named capture groups, together with the lastgroup attribute of the match object lets you avoid the walk. For example re.match("(?P<ag>a)|(?P<bg>b)","b").lastgroup='bg' – Guadalajara 26/9, 2008 at 7:14

@EliBendersky: Not if the regexp implementation is smart enough. Your alternatives will be merged in a way that if they have some common prefixes, they will be recognized in a single pass, and only the differing characters will make a split. For example, this pattern: "for|foreach|forbidden" and the string "foreach" it should match first three letters (the common prefix) only once, and then choosing the correct path depending on the first non-common character, here 'e' would choose the 2nd option and verify if the rest of it, 'ach', still matches. If it doesn't, none of the others can either. – Ourselves 23/5, 2014 at 22:52

H

7

I found this in python document. It's just simple and elegant.

import collections
import re

Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])

def tokenize(s):
    keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
    token_specification = [
        ('NUMBER',  r'\d+(\.\d*)?'), # Integer or decimal number
        ('ASSIGN',  r':='),          # Assignment operator
        ('END',     r';'),           # Statement terminator
        ('ID',      r'[A-Za-z]+'),   # Identifiers
        ('OP',      r'[+*\/\-]'),    # Arithmetic operators
        ('NEWLINE', r'\n'),          # Line endings
        ('SKIP',    r'[ \t]'),       # Skip over spaces and tabs
    ]
    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    get_token = re.compile(tok_regex).match
    line = 1
    pos = line_start = 0
    mo = get_token(s)
    while mo is not None:
        typ = mo.lastgroup
        if typ == 'NEWLINE':
            line_start = pos
            line += 1
        elif typ != 'SKIP':
            val = mo.group(typ)
            if typ == 'ID' and val in keywords:
                typ = val
            yield Token(typ, val, line, mo.start()-line_start)
        pos = mo.end()
        mo = get_token(s, pos)
    if pos != len(s):
        raise RuntimeError('Unexpected character %r on line %d' %(s[pos], line))

statements = '''
    IF quantity THEN
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
'''

for token in tokenize(statements):
    print(token)

The trick here is the line:

tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)

Here (?P<ID>PATTERN) will mark the matched result with a name specified by ID.

Hanghangar answered 17/2, 2013 at 8:51 Comment(0)

L

3

re.match is anchored. You can give it a position argument:

pos = 0
end = len(text)
while pos < end:
    match = regexp.match(text, pos)
    # do something with your match
    pos = match.end()

Have a look for pygments which ships a shitload of lexers for syntax highlighting purposes with different implementations, most based on regular expressions.

Lecherous answered 25/9, 2008 at 15:38 Comment(3)

How does what help? Anchoring? No need to slice the text. – Lecherous 25/9, 2008 at 19:36

I see. So I gues I'll be able to save the time slicing takes ? – Bryonbryony 26/9, 2008 at 4:52

Not only the time, also the memory for the slice. What's also important is that if you use anchoring "^" and "$" will work as expected. – Lecherous 26/9, 2008 at 5:12

G

3

It's possible that combining the token regexes will work, but you'd have to benchmark it. Something like:

x = re.compile('(?P<NUMBER>[0-9]+)|(?P<VAR>[a-z]+)')
a = x.match('9999').groupdict() # => {'VAR': None, 'NUMBER': '9999'}
if a:
    token = [a for a in a.items() if a[1] != None][0]

The filter is where you'll have to do some benchmarking...

Update: I tested this, and it seems as though if you combine all the tokens as stated and write a function like:

def find_token(lst):
    for tok in lst:
        if tok[1] != None: return tok
    raise Exception

You'll get roughly the same speed (maybe a teensy faster) for this. I believe the speedup must be in the number of calls to match, but the loop for token discrimination is still there, which of course kills it.

Gain answered 25/9, 2008 at 19:24 Comment(1)

You can use a.lastgroup to get the name of the last matched group, and a.group(a.lastgroup) to get the matching string for that group. No need to build the whole dictionary and find the entry that's not None. – Metamer 25/8, 2015 at 10:43

R

1

This isn't exactly a direct answer to your question, but you might want to look at ANTLR. According to this document the python code generation target should be up to date.

As to your regexes, there are really two ways to go about speeding it up if you're sticking to regexes. The first would be to order your regexes in the order of the probability of finding them in a default text. You could figure adding a simple profiler to the code that collected token counts for each token type and running the lexer on a body of work. The other solution would be to bucket sort your regexes (since your key space, being a character, is relatively small) and then use a array or dictionary to perform the needed regexes after performing a single discrimination on the first character.

However, I think that if you're going to go this route, you should really try something like ANTLR which will be easier to maintain, faster, and less likely to have bugs.

Rickey answered 25/9, 2008 at 15:40 Comment(0)

Recommended topics

Hot tags