I want to parse through a SQL Select Statement that has all the features a normal SQL dialect like MySQL has too. I looked for parsing libraries in python but couldn't find one that is doing the job. By that I mean I found some parsing libraries, but they were only able to parse through basic select statements (FROM and WHERE, not even ORDER BY). So as alternative I made my own parser (which I know is not a great solution at all). I spent a few hours working on it, but I keep getting an weird error and don't know how to approach it. Before I show the code I just want to mention that if you know a python library that is able to parse through SQL statements, not just SELECT but also CREATE TABLE, INSERT, etc., let me know.
Here is my language grammar string:
select_grammar = """
start: select_statement ";"
select_statement: "SELECT" column_list "FROM" table_list join_list? where_clause? groupby_clause? having_clause? orderby_clause?
column_list: "*" | column_expr ("," column_expr)*
column_expr: function_call | column_name | subquery
column_name: (table_name ".")? NAME ("AS" NAME)?
table_name: NAME ("AS" NAME)?
function_call: NAME "(" function_args ")" ("AS" NAME)?
function_args: expression ("," expression)*
where_clause: "WHERE" condition
groupby_clause: "GROUP BY" column_expr ("," column_expr)*
having_clause: "HAVING" logical_expr
orderby_clause: "ORDER BY" order_column ("," order_column)*
order_column: column_expr ["ASC" | "DESC"]?
condition: logical_expr
logical_expr: logical_term
| logical_expr "AND" logical_term
| logical_expr "OR" logical_term
| "NOT" logical_term
logical_term: comparison_expr
| "(" logical_expr ")"
| subquery
comparison_expr: expression OPERATOR expression
| expression "IS" ("NULL" | "NOT NULL")
expression: (table_name ".")? NAME | INT | string | function_call | subquery
table_list: table_name ("," table_name)* | subquery
subquery: "(" select_statement ")"
join_list: join_expr+
join_expr: join_type (table_name | subquery) "ON" condition
join_type: "INNER JOIN" | "LEFT JOIN" | "RIGHT JOIN" | "FULL JOIN"
string: ESCAPED_STRING | /'[^']*'/
OPERATOR: ">" | "<" | ">=" | "<=" | "=" | "!="
%import common.CNAME -> NAME
%import common.INT
%import common.ESCAPED_STRING
%import common.WS
%ignore WS
"""
I also created the Transformer class, which looks like this:
@v_args(inline=True)
class SelectTransformer(Transformer):
def start(self, *args):
print("start result: ", args)
return Tree("SELECT statement", args)
def column_list(self, *args):
return args
def column_expr(self, *args):
return args[0] if len(args) == 1 else args
def function_call(self, name, args, alias=None):
return (name, args, alias)
def subquery(self, value):
print("Subquery:", value)
def where_clause(self, condition=None):
return condition
def groupby_clause(self, *args):
return args
def having_clause(self, condition=None):
return condition
def orderby_clause(self, *args):
return args
def order_column(self, *args):
return args
def condition(self, *args):
return args
def logical_expr(self, *args):
return args
def logical_term(self, *args):
return args
def comparison_expr(self, *args):
return args
def expression(self, *args):
return args[0] if len(args) == 1 else args
def column_name(self, *args):
if len(args) == 1:
return args[0] # No alias present
elif len(args) == 3:
return args[0], args[2] # Alias present, return a tuple
else:
return args
def table_list(self, *args):
return args
def join_list(self, *args):
return args
def join_expr(self, *args):
return args
def join_type(self, *args):
return args
def subquery(self, *args):
return args
def string(self, value):
return value.strip("'")
def table_name(self, *args):
if len(args) == 1:
return args[0] # No alias present
elif len(args) == 3:
return args[0], args[2] # Alias present, return a tuple
else:
return args
I don't know if it matters, I also created a little function that displays the final tree nicely:
def format_ast(ast, level=0):
result = ""
indent = " " * level
if isinstance(ast, tuple):
for item in ast:
result += format_ast(item, level + 1)
elif isinstance(ast, Token):
result += f"{indent}{ast.type}, Token('{ast.value}')\n"
elif isinstance(ast, Tree):
result += f"{indent}Tree({ast.data}), [\n"
for child in ast.children:
result += format_ast(child, level + 1)
result += f"{indent}]\n"
else:
result += f"{indent}{ast}\n"
return result
Here's the statement I'm parsing:
sql_query = 'SELECT ' \
'name AS alias, ' \
'COUNT(age) AS age_alias, ' \
'(SELECT department_name FROM departments WHERE department_id = employees.department_id) ' \
'FROM employees AS emp, department ' \
'INNER JOIN departments AS dep ON employees.department_id = departments.id ' \
'LEFT JOIN other_table AS ot ON other_table.id = employees.table_id ' \
'WHERE age > 25 ' \
'GROUP BY age, name ' \
'HAVING COUNT(age) > 1 ' \
'ORDER BY name ASC, age DESC;'
The code I'm executing is this:
parser = Lark(select_with_joins_grammar, parser='lalr', transformer=SelectTransformer())
tree = parser.parse(sql_query)
# Print the custom export format
print(format_ast(tree))
The problem is related to the method join_type() of my class SelectTransformer. Somehow *args is always empty, although it should theoretically contain (like defined in the rule) "INNER JOIN" or "LEFT JOIN" or "RIGHT JOIN" or "FULL JOIN". My output looks like this:
Tree(SELECT statement), [
Tree(select_statement), [
NAME, Token('name')
NAME, Token('alias')
NAME, Token('COUNT')
Tree(function_args), [
NAME, Token('age')
]
NAME, Token('age_alias')
Tree(select_statement), [
NAME, Token('department_name')
NAME, Token('departments')
NAME, Token('department_id')
OPERATOR, Token('=')
NAME, Token('employees')
NAME, Token('department_id')
]
NAME, Token('employees')
NAME, Token('emp')
NAME, Token('department')
NAME, Token('departments')
NAME, Token('dep')
NAME, Token('employees')
NAME, Token('department_id')
OPERATOR, Token('=')
NAME, Token('departments')
NAME, Token('id')
NAME, Token('other_table')
NAME, Token('ot')
NAME, Token('other_table')
NAME, Token('id')
OPERATOR, Token('=')
NAME, Token('employees')
NAME, Token('table_id')
NAME, Token('age')
OPERATOR, Token('>')
INT, Token('25')
NAME, Token('age')
NAME, Token('name')
NAME, Token('COUNT')
Tree(function_args), [
NAME, Token('age')
]
None
OPERATOR, Token('>')
INT, Token('1')
NAME, Token('name')
NAME, Token('age')
]
]
As you can see, no join type is displayed. I am relatively new to parsing so I don't really know what to try.
The answer is that case matters. The grammar defines a combination of rules and terminals. Rules have lower case names, while terminals have upper-case names. It appears that only terminals will capture their matched content as tokens. (There may be a more formal way of saying this, but this is accurate enough for this discussion.)
So instead of:
join_expr: join_type (table_name | subquery) "ON" condition
join_type : "INNER JOIN" | "LEFT JOIN" | "RIGHT JOIN" | "FULL JOIN"
try:
join_expr: JOIN_TYPE (table_name | subquery) "ON" condition
JOIN_TYPE: "INNER JOIN" | "LEFT JOIN" | "RIGHT JOIN" | "FULL JOIN"
That will yield results that includes content like JOIN_TYPE, Token('INNER JOIN')
and JOIN_TYPE, Token('LEFT JOIN')
.
Another alternative is to make each join type its own rule, as follows:
join_type: inner_join | left_join | right_join | full_join
inner_join: "INNER"? "JOIN"
left_join: "LEFT" "OUTER"? "JOIN"
right_join: "RIGHT" "OUTER"? "JOIN"
full_join: "FULL" "OUTER"? "JOIN"
The above will yield distinct nodes instead of tokens in the parsed syntax, presented as Tree(inner_join)
and Tree(left_join)
in the output. The exact syntax variation used ("JOIN" vs "INNER JOIN") is not captured.
There may be other approaches for capturing the join type, but this is not my area of expertise.
Note that the syntax still has a long way to go, such as allowing alternatives like "JOIN", "LEFT OUTER JOIN", "FULL OUTER JOIN", "CROSS APPLY", "OUTER APPLY", and nested joins of the form TableA A LEFT JOIN (TableB B JOIN TableC C ON C.X = B.X) ON B.Y = A.Y
. You've taken on quite a challenge.