Similar to "n_unique" aggregation using DuckDB relational API
But, I need to count null values
Say I have
import duckdb
rel = duckdb.sql('select * from values (1, 4), (2, null), (null, null) df(a, b)')
rel
Out[3]:
┌───────┬───────┐
│ a │ b │
│ int32 │ int32 │
├───────┼───────┤
│ 1 │ 4 │
│ 2 │ NULL │
│ NULL │ NULL │
└───────┴───────┘
I would like to make a duckdb.Expression
which I can use to count the number of unique values including nulls
The solution suggested in the linked question:
def n_unique(column_name: str) -> duckdb.Expression:
return duckdb.FunctionExpression(
'array_unique',
duckdb.FunctionExpression(
'array_agg',
duckdb.ColumnExpression(column_name)
)
)
is not quite right here, as it skips nulls:
In [39]: rel.aggregate([n_unique('a'), n_unique('b')])
Out[39]:
┌────────────────────────────┬────────────────────────────┐
│ array_unique(array_agg(a)) │ array_unique(array_agg(b)) │
│ uint64 │ uint64 │
├────────────────────────────┼────────────────────────────┤
│ 2 │ 1 │
└────────────────────────────┴────────────────────────────┘
My expected output would be:
In [39]: rel.aggregate([n_unique('a'), n_unique('b')])
Out[39]:
┌────────────────────────────┬────────────────────────────┐
│ array_unique(array_agg(a)) │ array_unique(array_agg(b)) │
│ uint64 │ uint64 │
├────────────────────────────┼────────────────────────────┤
│ 3 │ 2 │
└────────────────────────────┴────────────────────────────┘
How can I achieve that?
If you know the range of your values you can use some out-of-range value as a placeholder to include nulls as well, using CaseExpression()
, .isnotnull()
and ConstantExpression()
:
def n_unique(column_name: str) -> duckdb.Expression:
return duckdb.FunctionExpression(
'array_unique',
duckdb.FunctionExpression(
'array_agg',
duckdb.CaseExpression(
condition = duckdb.ColumnExpression(column_name).isnotnull(),
value = duckdb.ColumnExpression(column_name)
).otherwise(
duckdb.ConstantExpression(-1)
)
)
).alias('result')
┌────────┬────────┐
│ result │ result │
│ uint64 │ uint64 │
├────────┼────────┤
│ 3 │ 2 │
└────────┴────────┘
Or, without placeholder, something like this:
def n_unique(column_name: str) -> duckdb.Expression:
return (
duckdb.FunctionExpression(
'array_unique',
duckdb.FunctionExpression(
'array_agg',
duckdb.ColumnExpression(column_name)
)
) +
duckdb.FunctionExpression(
'max',
duckdb.CaseExpression(
condition = duckdb.ColumnExpression(column_name).isnotnull(),
value = duckdb.ConstantExpression(0)
).otherwise(
duckdb.ConstantExpression(1)
)
)
).alias('result')
Can't say it's really pretty but it does work.