sqlregexparsinggoogle-bigquerydata-extraction

How to extract the table name from a CREATE/UPDATE/INSERT statement in an SQL query?


I am trying to parse the table being created, inserted into or updated from the following sql queries stored in a table column.

Let's call the table column query. Following is some sample data to demonstrate variations in how the data could look like.

with sample_data as (
  select 1 as id, 'CREATE TABLE tbl1 ...' as query union all
  select 2 as id, 'CREATE OR REPLACE TABLE tbl1 ...' as query union all
  select 3 as id, 'DROP TABLE IF EXISTS tbl1; CREATE TABLE tbl1 ...' as query union all
  select 4 as id, 'INSERT /*some comment*/ INTO tbl2 ...' as query union all
  select 5 as id, 'INSERT /*some comment*/ INTO tbl2 ...' as query union all
  select 6 as id, 'UPDATE tbl3 SET col1 = ...' as query union all
  select 7 as id, '/*some garbage comments*/ UPDATE tbl3 SET col1 = ...' as query union all  
  select 8 as id, 'DELETE tbl4 ...' as query
),

Following are the formats of the queries (we are trying to extract table_name ):

#1 some optional statements like drop table CREATE some comments or optional statement like OR REPLACE TABLE table_name everything else

#2 some optional statements like drop table INSERT some comments INTO some comments table_name

#3 some optional statements like drop table UPDATE some comments table_name everything else


Solution

  • TL;DR

    (?:(?:CREATE(?:(?:\s*\/\*.*\*\/\s*?)*\s+OR(?:\s*\/\*.*\*\/\s*?)*\s+REPLACE)?|DROP)(?:\s*\/\*.*\*\/\s*?)*\s+TABLE(?:(?:\s*\/\*.*\*\/\s*?)*\s+IF(?:\s*\/\*.*\*\/\s*?)*\s+EXISTS)?|UPDATE|DELETE|INSERT(?:\s*\/\*.*\*\/\s*?)*\s+INTO)(?:\s*\/\*.*\*\/\s*?)*\s+([^\s\/*;]+)

    "Show your working"...

    To construct a suitable regex, let's start with the following relatively simple/readable version:

    ((CREATE( OR REPLACE)?|DROP) TABLE( IF EXISTS)?|UPDATE|DELETE|INSERT INTO) ([^\s\/*;]+)

    All the spaces above could be replaced with "at least one whitespace character", i.e. \s+. But we also need to allow comments. For a comment that looks like /*anything*/ the regex looks like \/\*.*\*\/ (where the comment characters are escaped with \ and "anything" is the .* in the middle). Given there could be multiple such comments, optionally separated by whitespace, we end up with (\s*\/\*.*\*\/\s*?)*\s+. Plugging this in everywhere there was a space gives:

    ((CREATE((\s*\/\*.*\*\/\s*?)*\s+OR(\s*\/\*.*\*\/\s*?)*\s+REPLACE)?|DROP)(\s*\/\*.*\*\/\s*?)*\s+TABLE((\s*\/\*.*\*\/\s*?)*\s+IF(\s*\/\*.*\*\/\s*?)*\s+EXISTS)?|UPDATE|DELETE|INSERT(\s*\/\*.*\*\/\s*?)*\s+INTO)(\s*\/\*.*\*\/\s*?)*\s+([^\s\/*;]+)

    One further refinement needs to be made: Bracketed expressions have been used for choices, e.g. (CHOICE1|CHOICE2). But this syntax includes them as capturing groups. Actually we only require one capturing group for the table name so we can exclude all the other capturing groups via ?:, e.g. (?:CHOICE1|CHOICE2). This produces the regex given above under "TL;DR".

    Online Regex Demo

    Here's a demo of it working with your examples: Regex101 demo

    SQL

    The Google BigQuery documentation for REGEXP_EXTRACT says it will return the substring matched by the capturing group. So I'd expect something like this to work:

    with sample_data as (
      select 1 as id, 'CREATE TABLE tbl1 ...' as query union all
      select 2 as id, 'CREATE OR REPLACE TABLE tbl1 ...' as query union all
      select 3 as id, 'DROP TABLE IF EXISTS tbl1; CREATE TABLE tbl1 ...' as query union all
      select 4 as id, 'INSERT /*some comment*/ INTO tbl2 ...' as query union all
      select 5 as id, 'INSERT /*some comment*/ INTO tbl2 ...' as query union all
      select 6 as id, 'UPDATE tbl3 SET col1 = ...' as query union all
      select 7 as id, '/*some garbage comments*/ UPDATE tbl3 SET col1 = ...' as query union all  
      select 8 as id, 'DELETE tbl4 ...' as query
    )
    
    SELECT
      *, REGEXP_EXTRACT(query, r"(?:(?:CREATE(?:(?:\s*\/\*.*\*\/\s*?)*\s+OR(?:\s*\/\*.*\*\/\s*?)*\s+REPLACE)?|DROP)(?:\s*\/\*.*\*\/\s*?)*\s+TABLE(?:(?:\s*\/\*.*\*\/\s*?)*\s+IF(?:\s*\/\*.*\*\/\s*?)*\s+EXISTS)?|UPDATE|DELETE|INSERT(?:\s*\/\*.*\*\/\s*?)*\s+INTO)(?:\s*\/\*.*\*\/\s*?)*\s+([^\s\/*;]+)") AS table_name
    FROM sample_data;
    

    (The above is untested so please let me know in the comments if there are any issues.)