pythonduckdb

"n_unique" aggregation using DuckDB relational API which counts nulls


Similar to "n_unique" aggregation using DuckDB relational API

But, I need to count null values

Say I have

import duckdb

rel = duckdb.sql('select * from values (1, 4), (2, null), (null, null) df(a, b)')
rel
Out[3]: 
┌───────┬───────┐
│   a   │   b   │
│ int32 │ int32 │
├───────┼───────┤
│     1 │     4 │
│     2 │  NULL │
│  NULL │  NULL │
└───────┴───────┘

I would like to make a duckdb.Expression which I can use to count the number of unique values including nulls

The solution suggested in the linked question:

def n_unique(column_name: str) -> duckdb.Expression:
    return duckdb.FunctionExpression(
        'array_unique',
        duckdb.FunctionExpression(
            'array_agg',
            duckdb.ColumnExpression(column_name)
        )
    )

is not quite right here, as it skips nulls:

In [39]: rel.aggregate([n_unique('a'), n_unique('b')])
Out[39]:
┌────────────────────────────┬────────────────────────────┐
│ array_unique(array_agg(a)) │ array_unique(array_agg(b)) │
│           uint64           │           uint64           │
├────────────────────────────┼────────────────────────────┤
│                          2 │                          1 │
└────────────────────────────┴────────────────────────────┘

My expected output would be:

In [39]: rel.aggregate([n_unique('a'), n_unique('b')])
Out[39]:
┌────────────────────────────┬────────────────────────────┐
│ array_unique(array_agg(a)) │ array_unique(array_agg(b)) │
│           uint64           │           uint64           │
├────────────────────────────┼────────────────────────────┤
│                          3 │                          2 │
└────────────────────────────┴────────────────────────────┘

How can I achieve that?


Solution

  • If you know the range of your values you can use some out-of-range value as a placeholder to include nulls as well, using CaseExpression(), .isnotnull() and ConstantExpression():

    def n_unique(column_name: str) -> duckdb.Expression:
        return duckdb.FunctionExpression(
            'array_unique',
            duckdb.FunctionExpression(
                'array_agg',
                duckdb.CaseExpression(
                    condition = duckdb.ColumnExpression(column_name).isnotnull(),
                    value = duckdb.ColumnExpression(column_name)
                ).otherwise(
                    duckdb.ConstantExpression(-1)
                )
            )
        ).alias('result')
    
    ┌────────┬────────┐
    │ result │ result │
    │ uint64 │ uint64 │
    ├────────┼────────┤
    │      3 │      2 │
    └────────┴────────┘
    

    Or, without placeholder, something like this:

    def n_unique(column_name: str) -> duckdb.Expression:
        return (
            duckdb.FunctionExpression(
                'array_unique',
                duckdb.FunctionExpression(
                    'array_agg',
                    duckdb.ColumnExpression(column_name)
                )
            ) +
            duckdb.FunctionExpression(
                'max',
                duckdb.CaseExpression(
                    condition = duckdb.ColumnExpression(column_name).isnotnull(),
                    value = duckdb.ConstantExpression(0)
                ).otherwise(
                    duckdb.ConstantExpression(1)
                )
            )
        ).alias('result')
    

    Can't say it's really pretty but it does work.