Using PLY to parse SQL statements

SELECT = r'SELECT' FROM = r'FROM' COLUMN = TABLE = r'[a-zA-Z]+' COMMA = r',' STAR = r'\*' END = r';' t_ignore = ' ' #ignores spaces statement : SELECT columns FROM TABLE END columns : STAR | rec_columns rec_columns : COLUMN | rec_columns COMMA COLUMN

#!/usr/bin/python import ply.lex as lex import ply.yacc as yacc tokens = ( 'SELECT', 'FROM', 'WHERE', 'TABLE', 'COLUMN', 'STAR', 'COMMA', 'END', ) t_SELECT = r'select|SELECT' t_FROM = r'from|FROM' t_WHERE = r'where|WHERE' t_TABLE = r'[a-zA-Z]+' t_COLUMN = r'[a-zA-Z]+' t_STAR = r'\*' t_COMMA = r',' t_END = r';' t_ignore = ' \t' def t_error(t): print 'Illegal character "%s"' % t.value[0] t.lexer.skip(1) lex.lex() NONE, SELECT, INSERT, DELETE, UPDATE = range(5) states = ['NONE', 'SELECT', 'INSERT', 'DELETE', 'UPDATE'] current_state = NONE def p_statement_expr(t): 'statement : expression' print states[current_state], t[1] def p_expr_select(t): 'expression : SELECT columns FROM TABLE END' global current_state current_state = SELECT print t[3] def p_recursive_columns(t): '''recursive_columns : recursive_columns COMMA COLUMN''' t[0] = ', '.join([t[1], t[3]]) def p_recursive_columns_base(t): '''recursive_columns : COLUMN''' t[0] = t[1] def p_columns(t): '''columns : STAR | recursive_columns''' t[0] = t[1] def p_error(t): print 'Syntax error at "%s"' % t.value if t else 'NULL' global current_state current_state = NONE yacc.yacc() while True: try: input = raw_input('sql> ') except EOFError: break yacc.parse(input)

I think your problem is that your regular expressions for t_TABLE and t_COLUMN are also matching your reserved words (SELECT and FROM). In other words, SELECT a FROM b; tokenizes to something like COLUMN COLUMN COLUMN COLUMN END (or some other ambiguous tokenization) and this doesn't match any of your productions so you get a syntax error.

As a quick sanity check, change those regular expressions to match exactly what you're typing in like this:

t_TABLE = r'b'
t_COLUMN = r'a'

You will see that the syntax SELECT a FROM b; passes because the regular expressions 'a' and 'b' don't match your reserved words.

And, there's another problem that the regular expressions for TABLE and COLUMN overlap as well, so the lexer can't tokenize without ambiguity with respect to those tokens either.

There's a subtle, but relevant section in the PLY documentation regarding this. Not sure the best way to explain this, but the trick is that the tokenization pass happens first so it can't really use context from your production rules to know whether it has come across a TABLE token or a COLUMN token. You need to generalize those into some kind of ID token and then weed things out during the parse.

If I had some more energy I'd try to work through your code some more and provide an actual solution in code, but I think since you've already expressed that this is a learning exercise that perhaps you will be content with me pointing in the right direction.

Recommended topics

Hot tags