Using PLY to parse SQL statements
Asked Answered
M

1

7

I know there are other tools out there to parse SQL statements, but I am rolling out my own for educational purposes. I am getting stuck with my grammar right now.. If you can spot an error real quick please let me know.

SELECT = r'SELECT'
FROM = r'FROM'
COLUMN = TABLE = r'[a-zA-Z]+'
COMMA = r','
STAR = r'\*'
END = r';'
t_ignore = ' ' #ignores spaces

statement : SELECT columns FROM TABLE END

columns : STAR
        | rec_columns

rec_columns : COLUMN
            | rec_columns COMMA COLUMN

When I try to parse a statement like 'SELECT a FROM b;' I get an syntax error at the FROM token... Any help is greatly appreciated!

(Edit) Code:

#!/usr/bin/python
import ply.lex as lex
import ply.yacc as yacc

tokens = (
    'SELECT',
    'FROM',
    'WHERE',
    'TABLE',
    'COLUMN',
    'STAR',
    'COMMA',
    'END',
)

t_SELECT    = r'select|SELECT'
t_FROM      = r'from|FROM'
t_WHERE     = r'where|WHERE'
t_TABLE     = r'[a-zA-Z]+'
t_COLUMN    = r'[a-zA-Z]+'
t_STAR      = r'\*'
t_COMMA     = r','
t_END       = r';'

t_ignore    = ' \t'

def t_error(t):
    print 'Illegal character "%s"' % t.value[0]
    t.lexer.skip(1)

lex.lex()

NONE, SELECT, INSERT, DELETE, UPDATE = range(5)
states = ['NONE', 'SELECT', 'INSERT', 'DELETE', 'UPDATE']
current_state = NONE

def p_statement_expr(t):
    'statement : expression'
    print states[current_state], t[1]

def p_expr_select(t):
    'expression : SELECT columns FROM TABLE END'
    global current_state
    current_state = SELECT
    print t[3]


def p_recursive_columns(t):
    '''recursive_columns : recursive_columns COMMA COLUMN'''
    t[0] = ', '.join([t[1], t[3]])

def p_recursive_columns_base(t):
    '''recursive_columns : COLUMN'''
    t[0] = t[1]

def p_columns(t):
    '''columns : STAR
               | recursive_columns''' 
    t[0] = t[1]

def p_error(t):
    print 'Syntax error at "%s"' % t.value if t else 'NULL'
    global current_state
    current_state = NONE

yacc.yacc()


while True:
    try:
        input = raw_input('sql> ')
    except EOFError:
        break
    yacc.parse(input)
Mcdonough answered 8/9, 2011 at 22:57 Comment(8)
What if you put a space between b and ;?Tasse
Still the same result, I forgot to add that it ignores spaces... I'll edit that in.Mcdonough
As far as I know, most databases will support at least [a-zA-Z_0-9] as table names.Stander
Yeah, I am just trying to figure what is wrong with the grammar of this select statement.Mcdonough
@NullUserException: As of this discussion becomes a little offtopp y - here is a huge list of different grammars (but for antlr) antlr.org/grammar/listTasse
@sampwing: curious - does it work without FROM part?Tasse
@zerkms: thanks for points out ANTLR it looks like its development is much more active than PLY. As far as leaving out the FROM part, if I try 'SELECT a ;', it will instead complain about the semicolon token...Mcdonough
I think it might help to post the actual code, including full token definitions (t_)Chapple
C
6

I think your problem is that your regular expressions for t_TABLE and t_COLUMN are also matching your reserved words (SELECT and FROM). In other words, SELECT a FROM b; tokenizes to something like COLUMN COLUMN COLUMN COLUMN END (or some other ambiguous tokenization) and this doesn't match any of your productions so you get a syntax error.

As a quick sanity check, change those regular expressions to match exactly what you're typing in like this:

t_TABLE = r'b'
t_COLUMN = r'a'

You will see that the syntax SELECT a FROM b; passes because the regular expressions 'a' and 'b' don't match your reserved words.

And, there's another problem that the regular expressions for TABLE and COLUMN overlap as well, so the lexer can't tokenize without ambiguity with respect to those tokens either.

There's a subtle, but relevant section in the PLY documentation regarding this. Not sure the best way to explain this, but the trick is that the tokenization pass happens first so it can't really use context from your production rules to know whether it has come across a TABLE token or a COLUMN token. You need to generalize those into some kind of ID token and then weed things out during the parse.

If I had some more energy I'd try to work through your code some more and provide an actual solution in code, but I think since you've already expressed that this is a learning exercise that perhaps you will be content with me pointing in the right direction.

Chapple answered 9/9, 2011 at 4:42 Comment(2)
I didn't walk through your response with my code completely yet.. But when you referenced the documentation, it gave me some insight.. Much appreciated Joe!Mcdonough
Read through the document when I was more awake.. The fix was obvious, thanks again Joe.Mcdonough

© 2022 - 2024 — McMap. All rights reserved.