hadoophivehiveqlapache-hive

Hive query counts of fields where fields are populated


I have a huge Hive table consisting of ten product fields, date fields for the purchases, and an identifier. The product fields are named like prod1, prod2, ... , prod10 and refer to the last ten products purchased. For most IDs, we don't have purchase history all the way back to ten products.

I'd like to construct a distribution of population rates for each of the prod<X> fields, to show the breakdown of purchase history across the entire dataset.

Currently, I'm running a bash script that runs ten consecutive queries against the table like:

hive -e "select count(1) from db.tbl where prod<X> != '';"

... and saving the output to a file. This seems clunky and inefficient. Is there a better way to specify Hive counts on a range of fields with a range of field conditions? I've tried to come up with a strategy using groupby or even mapping a range of fields, but can't quite wrap my head around specifying the != '' condition for each field.

Thanks in advance for any direction.


Solution

  • select id,
    sum(case when prod1='' then 0 else 1 end),
    sum(case when prod2='' then 0 else 1 end),
    sum(case when prod3='' then 0 else 1 end),
    sum(case when prod4='' then 0 else 1 end),
    sum(case when prod5='' then 0 else 1 end),
    sum(case when prod6='' then 0 else 1 end),
    sum(case when prod7='' then 0 else 1 end),
    sum(case when prod8='' then 0 else 1 end),
    sum(case when prod9='' then 0 else 1 end),
    sum(case when prod10='' then 0 else 1 end)
    from table group by id;