pythonsqlparsingabstract-syntax-treelark-parser

SQL Select Statement Parser not returning JOIN type


I want to parse through a SQL Select Statement that has all the features a normal SQL dialect like MySQL has too. I looked for parsing libraries in python but couldn't find one that is doing the job. By that I mean I found some parsing libraries, but they were only able to parse through basic select statements (FROM and WHERE, not even ORDER BY). So as alternative I made my own parser (which I know is not a great solution at all). I spent a few hours working on it, but I keep getting an weird error and don't know how to approach it. Before I show the code I just want to mention that if you know a python library that is able to parse through SQL statements, not just SELECT but also CREATE TABLE, INSERT, etc., let me know.

Here is my language grammar string:

select_grammar = """
    start: select_statement ";"

    select_statement: "SELECT" column_list "FROM" table_list join_list? where_clause? groupby_clause? having_clause? orderby_clause?

    column_list: "*" | column_expr ("," column_expr)*

    column_expr: function_call | column_name | subquery
    
    column_name: (table_name ".")? NAME ("AS" NAME)?
    
    table_name: NAME ("AS" NAME)?

    function_call: NAME "(" function_args ")" ("AS" NAME)?

    function_args: expression ("," expression)*

    where_clause: "WHERE" condition

    groupby_clause: "GROUP BY" column_expr ("," column_expr)*

    having_clause: "HAVING" logical_expr

    orderby_clause: "ORDER BY" order_column ("," order_column)*

    order_column: column_expr ["ASC" | "DESC"]?

    condition: logical_expr

    logical_expr: logical_term
                | logical_expr "AND" logical_term
                | logical_expr "OR" logical_term
                | "NOT" logical_term

    logical_term: comparison_expr
                | "(" logical_expr ")"
                | subquery

    comparison_expr: expression OPERATOR expression
                    | expression "IS" ("NULL" | "NOT NULL")

    expression: (table_name ".")? NAME | INT | string | function_call | subquery

    table_list: table_name ("," table_name)* | subquery

    subquery: "(" select_statement ")"

    join_list: join_expr+

    join_expr: join_type (table_name | subquery) "ON" condition

    join_type: "INNER JOIN" | "LEFT JOIN" | "RIGHT JOIN" | "FULL JOIN"

    string: ESCAPED_STRING | /'[^']*'/

    OPERATOR: ">" | "<" | ">=" | "<=" | "=" | "!="

    %import common.CNAME -> NAME
    %import common.INT
    %import common.ESCAPED_STRING
    %import common.WS
    %ignore WS
"""

I also created the Transformer class, which looks like this:

@v_args(inline=True)
class SelectTransformer(Transformer):
    def start(self, *args):
        print("start result: ", args)
        return Tree("SELECT statement", args)

    def column_list(self, *args):
        return args

    def column_expr(self, *args):
        return args[0] if len(args) == 1 else args

    def function_call(self, name, args, alias=None):
        return (name, args, alias)

    def subquery(self, value):
        print("Subquery:", value)

    def where_clause(self, condition=None):
        return condition

    def groupby_clause(self, *args):
        return args

    def having_clause(self, condition=None):
        return condition

    def orderby_clause(self, *args):
        return args

    def order_column(self, *args):
        return args

    def condition(self, *args):
        return args

    def logical_expr(self, *args):
        return args

    def logical_term(self, *args):
        return args

    def comparison_expr(self, *args):
        return args

    def expression(self, *args):
        return args[0] if len(args) == 1 else args

    def column_name(self, *args):
        if len(args) == 1:
            return args[0]  # No alias present
        elif len(args) == 3:
            return args[0], args[2]  # Alias present, return a tuple
        else:
            return args

    def table_list(self, *args):
        return args

    def join_list(self, *args):
        return args

    def join_expr(self, *args):
        return args

    def join_type(self, *args):
        return args

    def subquery(self, *args):
        return args

    def string(self, value):
        return value.strip("'")

    def table_name(self, *args):
        if len(args) == 1:
            return args[0]  # No alias present
        elif len(args) == 3:
            return args[0], args[2]  # Alias present, return a tuple
        else:
            return args

I don't know if it matters, I also created a little function that displays the final tree nicely:

def format_ast(ast, level=0):
    result = ""
    indent = "  " * level

    if isinstance(ast, tuple):
        for item in ast:
            result += format_ast(item, level + 1)
    elif isinstance(ast, Token):
        result += f"{indent}{ast.type}, Token('{ast.value}')\n"
    elif isinstance(ast, Tree):
        result += f"{indent}Tree({ast.data}), [\n"
        for child in ast.children:
            result += format_ast(child, level + 1)
        result += f"{indent}]\n"
    else:
        result += f"{indent}{ast}\n"

    return result

Here's the statement I'm parsing:

sql_query = 'SELECT ' \
        'name AS alias, ' \
        'COUNT(age) AS age_alias, ' \
        '(SELECT department_name FROM departments WHERE department_id = employees.department_id) ' \
        'FROM employees AS emp, department ' \
        'INNER JOIN departments AS dep ON employees.department_id = departments.id ' \
        'LEFT JOIN other_table AS ot ON other_table.id = employees.table_id ' \
        'WHERE age > 25 ' \
        'GROUP BY age, name ' \
        'HAVING COUNT(age) > 1 ' \
        'ORDER BY name ASC, age DESC;'

The code I'm executing is this:

parser = Lark(select_with_joins_grammar, parser='lalr', transformer=SelectTransformer())
tree = parser.parse(sql_query)

# Print the custom export format
print(format_ast(tree))

The problem is related to the method join_type() of my class SelectTransformer. Somehow *args is always empty, although it should theoretically contain (like defined in the rule) "INNER JOIN" or "LEFT JOIN" or "RIGHT JOIN" or "FULL JOIN". My output looks like this:

  Tree(SELECT statement), [
  Tree(select_statement), [
        NAME, Token('name')
        NAME, Token('alias')
        NAME, Token('COUNT')
        Tree(function_args), [
          NAME, Token('age')
        ]
        NAME, Token('age_alias')
        Tree(select_statement), [
            NAME, Token('department_name')
            NAME, Token('departments')
                  NAME, Token('department_id')
                  OPERATOR, Token('=')
                    NAME, Token('employees')
                    NAME, Token('department_id')
        ]
        NAME, Token('employees')
        NAME, Token('emp')
      NAME, Token('department')
          NAME, Token('departments')
          NAME, Token('dep')
                  NAME, Token('employees')
                  NAME, Token('department_id')
                OPERATOR, Token('=')
                  NAME, Token('departments')
                  NAME, Token('id')
          NAME, Token('other_table')
          NAME, Token('ot')
                  NAME, Token('other_table')
                  NAME, Token('id')
                OPERATOR, Token('=')
                  NAME, Token('employees')
                  NAME, Token('table_id')
            NAME, Token('age')
            OPERATOR, Token('>')
            INT, Token('25')
      NAME, Token('age')
      NAME, Token('name')
            NAME, Token('COUNT')
            Tree(function_args), [
              NAME, Token('age')
            ]
            None
          OPERATOR, Token('>')
          INT, Token('1')
        NAME, Token('name')
        NAME, Token('age')
  ]
]

As you can see, no join type is displayed. I am relatively new to parsing so I don't really know what to try.


Solution

  • The answer is that case matters. The grammar defines a combination of rules and terminals. Rules have lower case names, while terminals have upper-case names. It appears that only terminals will capture their matched content as tokens. (There may be a more formal way of saying this, but this is accurate enough for this discussion.)

    So instead of:

        join_expr: join_type (table_name | subquery) "ON" condition
    
        join_type : "INNER JOIN" | "LEFT JOIN" | "RIGHT JOIN" | "FULL JOIN"
    

    try:

        join_expr: JOIN_TYPE (table_name | subquery) "ON" condition
    
        JOIN_TYPE: "INNER JOIN" | "LEFT JOIN" | "RIGHT JOIN" | "FULL JOIN"
    

    That will yield results that includes content like JOIN_TYPE, Token('INNER JOIN') and JOIN_TYPE, Token('LEFT JOIN').

    Another alternative is to make each join type its own rule, as follows:

        join_type: inner_join | left_join | right_join | full_join
    
        inner_join: "INNER"? "JOIN"
    
        left_join: "LEFT" "OUTER"? "JOIN"
    
        right_join: "RIGHT" "OUTER"? "JOIN"
    
        full_join: "FULL" "OUTER"? "JOIN"
    

    The above will yield distinct nodes instead of tokens in the parsed syntax, presented as Tree(inner_join) and Tree(left_join) in the output. The exact syntax variation used ("JOIN" vs "INNER JOIN") is not captured.

    There may be other approaches for capturing the join type, but this is not my area of expertise.

    Note that the syntax still has a long way to go, such as allowing alternatives like "JOIN", "LEFT OUTER JOIN", "FULL OUTER JOIN", "CROSS APPLY", "OUTER APPLY", and nested joins of the form TableA A LEFT JOIN (TableB B JOIN TableC C ON C.X = B.X) ON B.Y = A.Y. You've taken on quite a challenge.