google-bigqueryexecuteinformation-schema

loop on tables in information schema to find their size and return a report


I would like to find the number of records for a list of tables that start with a common prefix via a SQL query.

In BigQuery I am trying a combination of CONCAT() or FORMAT() along with INFORMATION_SCHEMA.TABLES in FOR loops or WHILE loops that in my mind should eventually be executed with EXECUTE IMMEDIATE but I am not able to setup the final query.

This is what I have:

DECLARE query_string STRING;

FOR record IN (
  select CONCAT(
    "SELECT COUNT(*) AS cnt_rows,",
    "\"", table_name, "\"", " AS t_name ",
    "FROM ", CONCAT("`",table_schema,".",   table_name, "`" )
  ) AS MY_STMNT
FROM `<<my-project-id>>.<<my-dataset>>.INFORMATION_SCHEMA.TABLES`
WHERE table_name LIKE "my-table-prefix__YYYY-MM-DDT%"
) DO
EXECUTE IMMEDIATE
  -- 1st attempt:
  -- FORMAT( """
  --     %s
  --     """, record.MY_STMNT);

  -- 2nd attempt (this doesn't even execute because of "Syntax error: Unexpected keyword SET at xxx"):
  -- SET query_string = query_string || record.MY_STMNT || " UNION ALL ";
END FOR;

This fails because the results are separate as one for each table. I would Like all of them in one final output result where each row is:

  1. table name as t_name
  2. number of rows as cnt_rows

How do I do that?


Solution

  • The solution is to use a string variable to accumulate SQL statements combined with UNION ALL.

    The issue with my previous attempts was that I wanted to EXECUTE IMMEDIATE at each iteration in the for loop, instead of doing it after the entire loop.

    DECLARE query_string STRING;
    SET query_string = "";
    
    FOR record IN (
      -- dynamically build the SELECT query based on the data in INFORMATION_SCHEMA e.g.:
      -- - the table name
      -- - the number of rows in that table
      SELECT CONCAT(
        "SELECT ",
        "'", table_name, "'", " AS t_name, ",
        "COUNT(*) AS cnt_rows, ",
        "FROM ", CONCAT("`", table_schema, ".", table_name, "`")
      ) AS MY_STMNT
    FROM `<<my-project-id>>.<<my-dataset>>.INFORMATION_SCHEMA.TABLES`
    WHERE table_name LIKE "my-table-prefix__YYYY-MM-DDT%"
    ) DO
      SET query_string = query_string || record.MY_STMNT || " UNION ALL ";
    END FOR;
    
    -- just add a dummy "last row" to make "UNION ALL" work
    SET query_string = query_string || "SELECT 'x' AS t_name, 0 AS cnt_rows";
    -- make sure we rank the output table by the number of records from biggest to smallest
    SET query_string = query_string || " ORDER BY cnt_rows DESC";
    
    -- to inspect the SQL query, just print it
    SELECT query_string;
    
    -- invoke the SQL query
    EXECUTE IMMEDIATE query_string;
    

    Be careful though: if you combine together with UNION ALL multiple complex SQL queries e.g. SELECT, then BigQuery may throw an error like this:

    Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.. at [1:1]

    This of course depends on the content of the SQL query you are building. The procedure above may apply to your case without throwing errors when invoked at scale.