pythonpython-polars

Removing null values on selected columns only in Polars dataframe


I am trying to remove null values across a list of selected columns. But it seems that I might have got the with_columns operation not right. What's the right approach if you want to operate the removing only on selected columns?

df = pl.DataFrame(
    {
        "id": ["NY", "TK", "FD"], 
        "eat2000": [1, None, 3], 
        "eat2001": [-2, None, 4],
        "eat2002": [None, None, None],
        "eat2003": [-9, None, 8],
        "eat2004": [None, None, 8]
    }
); df
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id  ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
│ str ┆ i64     ┆ i64     ┆ f64     ┆ i64     ┆ i64     │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY  ┆ 1       ┆ -2      ┆ null    ┆ -9      ┆ null    │
│ TK  ┆ null    ┆ null    ┆ null    ┆ null    ┆ null    │
│ FD  ┆ 3       ┆ 4       ┆ null    ┆ 8       ┆ 8       │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
col_list = [word for word in df.columns if word.startswith(("eat"))]
(
    df
    .with_columns(
        pl.col(col_list).filter(~pl.fold(True, lambda acc, s: acc & s.is_null(), pl.all()))
    )
)
# InvalidOperationError: dtype String not supported in 'not' operation

Expected output:

┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id  ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
│ str ┆ i64     ┆ i64     ┆ f64     ┆ i64     ┆ i64     │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY  ┆ 1       ┆ -2      ┆ null    ┆ -9      ┆ null    │
│ FD  ┆ 3       ┆ 4       ┆ null    ┆ 8       ┆ 8       │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Solution

  • df.select(
        ~pl.all_horizontal(pl.col(r'^eat.*$').is_null())
    )
    
    shape: (3, 1)
    ┌───────┐
    │ all   │
    │ ---   │
    │ bool  │
    ╞═══════╡
    │ true  │
    │ false │
    │ true  │
    └───────┘
    

    DataFrame.filter can be used to keep only the true rows:

    df.filter(
        ~pl.all_horizontal(pl.col(r'^eat.*$').is_null())
    )
    
    shape: (2, 6)
    ┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
    │ id  ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
    │ --- ┆ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
    │ str ┆ i64     ┆ i64     ┆ f32     ┆ i64     ┆ i64     │
    ╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
    │ NY  ┆ 1       ┆ -2      ┆ null    ┆ -9      ┆ null    │
    │ FD  ┆ 3       ┆ 4       ┆ null    ┆ 8       ┆ 8       │
    └─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
    

    The ~ in front of the pl.all_horizontal stands for negation. Notice that we didn't need the col_list.

    One caution: the regex expression in the pl.col must start with ^ and end with $. These cannot be omitted, even if the resulting regex expression is otherwise valid.

    Alternately, if you don't like the ~ operator, there is .not_()

    df.filter(
        pl.all_horizontal(pl.col(r'^eat.*$').is_null()).not_()
    )
    

    Or, we can check if there are any non-null values instead:

    df.filter(
        pl.any_horizontal(pl.col(r'^eat.*$').is_not_null())
    )
    

    Other Notes

    As an aside, Polars has other dedicated horizontal functions e.g. min_horizontal, max_horizontal, sum_horizontal

    Edit - using fold

    FYI, here's how to use the fold method, if that is what you'd prefer. Note the use of pl.col with a regex expression.

    df.filter(
        ~pl.fold(True, lambda acc, s: acc & s.is_null(), exprs=pl.col(r'^eat.*$'))
    )
    
    shape: (2, 6)
    ┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
    │ id  ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
    │ --- ┆ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
    │ str ┆ i64     ┆ i64     ┆ null    ┆ i64     ┆ i64     │
    ╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
    │ NY  ┆ 1       ┆ -2      ┆ null    ┆ -9      ┆ null    │
    │ FD  ┆ 3       ┆ 4       ┆ null    ┆ 8       ┆ 8       │
    └─────┴─────────┴─────────┴─────────┴─────────┴─────────┘