I have a suggestion for the BQ folks: I think it would be very useful if there was a built-in function that would return the entropy of a column. A column of discrete categories or values would be relatively easy. Thoughts? Does this already exist but I didn't find it?
The simple solution is below - it computes number of distinct values in the column, and then takes logarithm on base 2 - this gives the number of bits needed to encode all different values, which is the column entropy.
SELECT LOG2(COUNT(DISTINCT column)) FROM Table
However, this doesn't take into account the fact that different values have different probabilities. Shannon entropy formula is -SUM(P(xi)*log(P(xi)) where P(xi) is probability of value xi. Here is an example how to compute that in BigQuery, Shannon entropy for column year
in natality
table:
select -sum(p*log2(p)) from (
select ratio_to_report(c) over() p from (
select year, count(*) c from publicdata:samples.natality group by 1))
UPDATE In case the column variable is not discrete type (i.e. FLOAT), it is possible to discretize the values. Example below shows one approach - first it finds the maximum and minimum values, computes the range, and then puts all the FLOAT values (weight_pound column in natality table) into 100 buckets. After that - problem is reduced to the entropy of INTEGER values.
select discrete_weight, count(*) from (
select
cast((weight_pounds - min_weight) * 100 / range_weight as integer)
as discrete_weight
from [publicdata:samples.natality] a cross join
(select
min(weight_pounds) as min_weight,
max(weight_pounds) - min(weight_pounds) as range_weight
from [publicdata:samples.natality]) b) group by 1