google-bigquery

Can I get help resolving an error while porting code from SQL Server to BQ Query error: Scalar subquery produced more than one element at [25:12]


I found some cool code on here that has been able to give me some good information about my data in tables in SQL Server. I love how this goes through each column of the specified table to give me the information I'm looking for.

SET NOCOUNT ON

DECLARE @Schema NVARCHAR(100) = 'dbo'
DECLARE @Table NVARCHAR(100) = 'myTable'
DECLARE @sql NVARCHAR(MAX) =''

IF OBJECT_ID ('tempdb..#TablePopulation') IS NOT NULL 
    DROP TABLE #TablePopulation

CREATE TABLE #TablePopulation 
(
    TableName sysname, 
    ColumnName sysname, 
    ColumnPosition int,
    NullCount numeric(9, 2), 
    NonNullCount numeric(9, 2), 
    TotalRows numeric(9, 2), 
    PercentNull numeric(9, 2), 
    PercentNotNull numeric(9, 2)
)

SELECT @sql += '
 INSERT INTO #TablePopulation
 SELECT  
    '''+TABLE_NAME+''' AS TableName
    ,'''+COLUMN_NAME+''' AS ColumnName
    ,'''+CONVERT(VARCHAR(5),ORDINAL_POSITION)+''' AS ColumnPosition
    ,SUM(CASE WHEN '+COLUMN_NAME+' IS NULL THEN 1 ELSE 0 END) AS CountNulls 
    ,COUNT(' +COLUMN_NAME+') AS CountnonNulls
    ,COUNT(*) AS TotalRows
    ,SUM(CASE WHEN '+COLUMN_NAME+' IS NULL THEN 1 ELSE 0 END) / CASE WHEN COUNT(*) <> 0 THEN COUNT(*) ELSE 0.1 END * 100 AS PercentNull
    ,COUNT(' + COLUMN_NAME + ') / CASE WHEN COUNT(*) <> 0 THEN COUNT(*) ELSE 0.1 END * 100 AS PercentNotNull
    FROM '+QUOTENAME(TABLE_SCHEMA)+'.'+QUOTENAME(TABLE_NAME)+';'+ CHAR(10)
 FROM INFORMATION_SCHEMA.COLUMNS
 WHERE TABLE_SCHEMA = @Schema
 AND TABLE_NAME = @Table

--INSERT INTO #Nulls 
EXEC sp_executesql @sql

SELECT * 
FROM #TablePopulation

DROP TABLE #TablePopulation

I'm trying to port this code to BigQuery so I can get similar data for the information we have moved to it.

I'm not nearly as knowledgeable about BQ SQL as T-SQL, and I'm running into an error:

Query error: Scalar subquery produced more than one element at [25:12]

My BigQuery code looks like this:

DECLARE TableName STRING DEFAULT 'myTable';
DECLARE sql STRING DEFAULT '';

SET @@dataset_project_id = 'myProject';
SET @@dataset_id = 'myDataset';

DROP TABLE IF EXISTS TablePopulation;

CREATE TEMP TABLE TablePopulation 
(
    tableName STRING,
    columnName STRING,
    columnPosition INT64,
    nullCount numeric(9,2),
    notNullCount numeric(9,2),
    totalRows numeric(9,2),
    percentNull numeric(9,2),
    percentNotNull numeric(9,2)
);

SET sql = (
    SELECT
        'INSERT INTO TablePopulation SELECT '
        || table_name || ' AS tableName,'
        || column_name || ' AS columnName, '
        || ordinal_position || ' AS columnPosition, '
        || 'SUM(CASE WHEN ' || column_name || ' IS NULL THEN 1 ELSE 0 END) AS nullCount, '
        || 'COUNT(' || column_name || ') AS notNullCount,'
        || 'COUNT(*) AS totalRows,'
        || 'SAFE_DIVIDE(SUM(CASE WHEN ' || column_name
            || ' IS NULL THEN 1 ELSE 0 END), COUNT(*)) * 100 AS percentNull,'
        || 'SAFE_DIVIDE(SUM(CASE WHEN ' || column_name
            || ' IS NOT NULL THEN 1 ELSE 0 END), COUNT(*)) * 100 AS percentNotNull'
        || 'FROM ' || table_name || ';'
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE table_name = TableName
);

--INSERT INTO #Nulls 
EXECUTE IMMEDIATE sql;

SELECT * 
FROM TablePopulation;

DROP TABLE TablePopulation;

Any suggestions on how to resolve this to help return results for a table would be greatly appreciated.


Solution

  • The problem is that INFORMATION_SCHEMA.COLUMNS yields more than one row for a given table name, but your SET sql = (SELECT expression FROM INFORMATION_SCHEMA.COLUMNS WHERE table_name = TableName); statement can only handle a single (scalar) result.

    The original query used a SELECT @sql += expression FROM ... hack to effectively concatenate the multiple generated statements into a single dynamic SQL string. You will need to do the same here.

    In SQL Server, the STRING_AGG() aggregation function is preferred, and it appears that BigQuery also has STRING_AGG(). I am not experienced with BigQuery, but I believe the following may do the trick:

    SET sql = (
        SELECT STRING_AGG(
            'INSERT INTO TablePopulation SELECT '
            || table_name || ' AS tableName,'
            || column_name || ' AS columnName, '
            || ordinal_position || ' AS columnPosition, '
            || 'SUM(CASE WHEN ' || column_name || ' IS NULL THEN 1 ELSE 0 END) AS nullCount, '
            || 'COUNT(' || column_name || ') AS notNullCount,'
            || 'COUNT(*) AS totalRows,'
            || 'SAFE_DIVIDE(SUM(CASE WHEN ' || column_name
                || ' IS NULL THEN 1 ELSE 0 END), COUNT(*)) * 100 AS percentNull,'
            || 'SAFE_DIVIDE(SUM(CASE WHEN ' || column_name
                || ' IS NOT NULL THEN 1 ELSE 0 END), COUNT(*)) * 100 AS percentNotNull'
            || 'FROM ' || table_name || ';'
        , '')
        FROM INFORMATION_SCHEMA.COLUMNS
        WHERE table_name = TableName
    );
    

    This should produce a concatenated series of SQL statements like INSERT ...;INSERT ...;INSERT ...;. If you prefer line breaks between the statements (for printing and debugging), you can replace the '' with the appropriate BigQuery newline string - possibly '\\n'.

    Addendum On further examination, your translated code has additional problems:

    1. The selected table_name and column_name values in the generated SQL need to be quoted as string literals, so that you have SELECT "MyTable" AS tableName, "MyColumn" as columnName, ... instead of SELECT MyTable AS tableName, MyColumn as columnName, ....
    2. The other table_name and column_name usages should also be modified to handle names that contain spaces or other special characters or are otherwise not valid as bare identifiers. I believe surrounding the names with back-ticks (`) is the proper (almost safe) BigQuery syntax.
    3. You may also need to guard against table names or column names that themselves contain quotes or special characters that could still lead to syntax errors in the generated SQL or could even exploit SQL Injection vulnerabilities. I suggest that you search for "BigQuery SQL Injection" for best practices.