[SOLVED] BigQuery: compute entropy of a column

BigQuery: compute entropy of a column

I have a suggestion for the BQ folks: I think it would be very useful if there was a built-in function that would return the entropy of a column. A column of discrete categories or values would be relatively easy. Thoughts? Does this already exist but I didn't find it?

Solution

The simple solution is below - it computes number of distinct values in the column, and then takes logarithm on base 2 - this gives the number of bits needed to encode all different values, which is the column entropy.

SELECT LOG2(COUNT(DISTINCT column)) FROM Table

However, this doesn't take into account the fact that different values have different probabilities. Shannon entropy formula is -SUM(P(xi)*log(P(xi)) where P(xi) is probability of value xi. Here is an example how to compute that in BigQuery, Shannon entropy for column year in natality table:

select -sum(p*log2(p)) from (
select ratio_to_report(c) over() p from (
select year, count(*) c from publicdata:samples.natality group by 1))

UPDATE In case the column variable is not discrete type (i.e. FLOAT), it is possible to discretize the values. Example below shows one approach - first it finds the maximum and minimum values, computes the range, and then puts all the FLOAT values (weight_pound column in natality table) into 100 buckets. After that - problem is reduced to the entropy of INTEGER values.

select discrete_weight, count(*) from (
select 
  cast((weight_pounds - min_weight) * 100 / range_weight as integer)
    as discrete_weight 
from [publicdata:samples.natality] a cross join 
(select 
  min(weight_pounds) as min_weight, 
  max(weight_pounds) - min(weight_pounds) as range_weight 
from [publicdata:samples.natality]) b) group by 1