I have a table with 300+ columns and I want to see if there is a way to get percentage of empty rows in each of the columns. I wanted to pull the data using Python and then write a function. However, there are 400,000,000+ rows so that is not feasible. So now I am using sum and case when statements to get the percentage of each column manually like so:
select (100.0 * sum(case when A = ' ' or A is null then 1 else 0 end)) / count(*) as A
,(100.0 * sum(case when B is null then 1 else 0 end)) / count(*) as B
,(100.0 * sum(case when C = ' ' or C is null then 1 else 0 end)) / count(*) as C
,(100.0 * sum(case when D = ' ' or D is null then 1 else 0 end)) / count(*) D
,(100.0 * sum(case when E = ' ' or E is null then 1 else 0 end)) / count(*) E
,(100.0 * sum(case when F is null then 1 else 0 end)) / count(*) as F
....
from table
I have done this for about 30 columns, but I am trying to see if there is a way to do this across all the columns in the table. Some columns contain integers so I can't use
case when C = ' ' or C is null then 1 else 0 end
I have to use
case when C is null then 1 else 0 end
Is there a way to continue using this logic, where the case statement changes depending on the type of column and get the percentages for all columns? Thanks.
My solution for this problem involved using Python and Redshift. Not sure if this is the best way to do it but this is what worked for me.
I first used the function from @nachospiu's comment 'pg_get_cols' to get all the column names
query = '''select col_name
from pg_get_cols("schema.table")
cols(view_schema name, view_name name, col_name name, col_type varchar, col_num int)
'''
cols = pd.read_sql_query(query, rsm.dbengine)
Then once I had all the column names, I created a for loop to get the percentage of each column using a try and except statement so that it would not fail if the column was not varchar type.
nulls = {}
for col in cols.col_name:
try:
que = '''select (100.0 * sum(case when {} = ' ' or {} is null then 1 else 0 end)) / count(*) as perc
from schema.table
'''.format(col, col)
perc = pd.read_sql(que, rsm.dbengine).perc.values[0]
nulls[col] = perc
except:
que = '''select (100.0 * sum(case when {} is null then 1 else 0 end)) / count(*) as perc
from schema.table
'''.format(col)
perc = pd.read_sql(que, rsm.dbengine).perc.values[0]
nulls[col] = perc
This produced a dictionary that allowed me to view the percentage of nulls for each column.