sqlgoogle-bigquerystatisticsmissing-datacalculation

Is there a simplified SQL query to return the number and percentage of missing values of a table ? (BigQuery)


The BigQuery earthquake public dataset has 47 columns most of each have missing values. I need an output showing a summary with column_name, total_entries, non_missing_entries & percentage_missing as the columns of that table.

Currently I am using the query shown here, repeating the blocks for all the 47 columns:

SELECT
    'id' AS column_name,
    COUNT(id) AS non_missing_entries,
    (COUNT(*) - COUNT(id)) * 100.0 / COUNT(*) AS percentage_missing
FROM
    `youtube-factcheck.earthquake_analysis.earthquakes_copy`

UNION ALL

SELECT
    'flag_tsunami' AS column_name,
    COUNT(flag_tsunami) AS non_missing_entries,
    (COUNT(*) - COUNT(flag_tsunami)) * 100.0 / COUNT(*) AS percentage_missing
FROM
    `youtube-factcheck.earthquake_analysis.earthquakes_copy`

UNION ALL

-- Repeat the above block for other columns
-- ...

Output:

| column_name|non_missing_entries | percentage_missing|
| -----------| -------------------| ------------------|
|flag_tsunami|                1869|  70.20564323290291|
|          id|                6273|                  0|
|         ...|                 ...|                ...|

Is there a SQL to avoid the long tedious work of writing 47 blocks of the same query?


Solution

  • UNPIVOT is your friend. (note I had to change the source as I don't have access to bigquery-public-data.noaa_significant_earthquakes.earthquakes)

    with cte as (
      select column_name,
            count(*) as non_missing_entries
      from (
          select cast(id as string) as id,
                 flag_tsunami,
                 cast(year as string) as year,
                 cast(month as string) as month,
                 cast(day as string) as day,
                 cast(hour as string) as hour,
                 cast(minute as string) as minute,
                 cast(second as string) as second
          from `bigquery-public-data.noaa_significant_earthquakes.earthquakes`)
          unpivot ( value for column_name in (id, flag_tsunami,year,month,day,hour,minute,second))
          group by column_name
    ),
    id_only as (
      select column_name,non_missing_entries
      from cte
      where column_name = 'id'
    )
    select cte.column_name,
          cte.non_missing_entries,
          (id_only.non_missing_entries - cte.non_missing_entries) * 100.0 / id_only.non_missing_entries as percentage_missing
    from cte
    cross join id_only;
    

    it returns this:

    enter image description here

    You'll have to:

    but I think that's better than UNIONing 47 times.