sqlamazon-redshiftredash

SQL - Get percentage of empty rows in all columns with integer and varchar columns?


I have a table with 300+ columns and I want to see if there is a way to get percentage of empty rows in each of the columns. I wanted to pull the data using Python and then write a function. However, there are 400,000,000+ rows so that is not feasible. So now I am using sum and case when statements to get the percentage of each column manually like so:

select (100.0 * sum(case when A = ' ' or A is null then 1 else 0 end)) / count(*) as A
       ,(100.0 * sum(case when B is null then 1 else 0 end)) / count(*) as B
       ,(100.0 * sum(case when C = ' ' or C is null then 1 else 0 end)) / count(*)  as C
       ,(100.0 * sum(case when D = ' ' or D is null then 1 else 0 end)) / count(*)  D
       ,(100.0 * sum(case when E = ' ' or E is null then 1 else 0 end)) / count(*)  E
       ,(100.0 * sum(case when F is null then 1 else 0 end)) / count(*) as F
       ....
from table

I have done this for about 30 columns, but I am trying to see if there is a way to do this across all the columns in the table. Some columns contain integers so I can't use

 case when C = ' ' or C is null then 1 else 0 end

I have to use

 case when C is null then 1 else 0 end
 

Is there a way to continue using this logic, where the case statement changes depending on the type of column and get the percentages for all columns? Thanks.


Solution

  • My solution for this problem involved using Python and Redshift. Not sure if this is the best way to do it but this is what worked for me.

    I first used the function from @nachospiu's comment 'pg_get_cols' to get all the column names

    query = '''select col_name
                from pg_get_cols("schema.table") 
                cols(view_schema name, view_name name, col_name name, col_type varchar, col_num int)
                '''
    cols = pd.read_sql_query(query, rsm.dbengine)
    

    Then once I had all the column names, I created a for loop to get the percentage of each column using a try and except statement so that it would not fail if the column was not varchar type.

    nulls = {}
    for col in cols.col_name:
        try: 
            que = '''select (100.0 * sum(case when {} = ' ' or {} is null then 1 else 0 end)) / count(*) as perc
                     from schema.table
                    '''.format(col, col)
            perc = pd.read_sql(que, rsm.dbengine).perc.values[0]
            nulls[col] = perc
        
        except: 
            
            que = '''select (100.0 * sum(case when  {} is null then 1 else 0 end)) / count(*) as perc
                     from schema.table
                    '''.format(col)
            perc = pd.read_sql(que, rsm.dbengine).perc.values[0]
            nulls[col] = perc
          
    

    This produced a dictionary that allowed me to view the percentage of nulls for each column.