I have a huge Hive table consisting of ten product fields, date fields for the purchases, and an identifier. The product fields are named like prod1
, prod2
, ... , prod10
and refer to the last ten products purchased. For most IDs, we don't have purchase history all the way back to ten products.
I'd like to construct a distribution of population rates for each of the prod<X>
fields, to show the breakdown of purchase history across the entire dataset.
Currently, I'm running a bash script that runs ten consecutive queries against the table like:
hive -e "select count(1) from db.tbl where prod<X> != '';"
... and saving the output to a file. This seems clunky and inefficient. Is there a better way to specify Hive counts on a range of fields with a range of field conditions? I've tried to come up with a strategy using groupby or even mapping a range of fields, but can't quite wrap my head around specifying the != ''
condition for each field.
Thanks in advance for any direction.
select id,
sum(case when prod1='' then 0 else 1 end),
sum(case when prod2='' then 0 else 1 end),
sum(case when prod3='' then 0 else 1 end),
sum(case when prod4='' then 0 else 1 end),
sum(case when prod5='' then 0 else 1 end),
sum(case when prod6='' then 0 else 1 end),
sum(case when prod7='' then 0 else 1 end),
sum(case when prod8='' then 0 else 1 end),
sum(case when prod9='' then 0 else 1 end),
sum(case when prod10='' then 0 else 1 end)
from table group by id;