pythonpython-polarsimputation

Imputing and adding rows to dataframe using polars expressions


I have a dataframe with incomplete values as below - in particular ages with corresponding years, and I would like to make it square (i.e., all three cust_id to have correctly imputed values for age in all three years, i.e. I want to turn this:

df = pl.DataFrame({
    "cust_id": [1, 2 ,2, 2, 3, 3],
    "year": [2000,1999,2000,2001,1999,2001],
    "cust_age": [21,31,32,33,44,46]
     
})

into this:

df = pl.DataFrame({
    "cust_id": [1, 1, 1, 2 ,2, 2, 3, 3, 3 ],
    "year": [1999,2000,2001,1999,2000,2001,1999,2000,2001],
    "cust_age": [20,21,22,31,32,33,44,45,46]
     
})

I know I can iteratively create new dataframes and then join or shift values, and potentially use apply and/or eval_map but I was hoping to find something quick and idiomatic to polars. Thanks in advance for your help!


Solution

  • Here is a possible approach

    
    # create all (year, cust_id) combinations
    index = df.select("year").unique().join(df.select("cust_id").unique(), how="cross")
    
    # compute the birth year of each customer as an expression
    birth_year = pl.col("year") - pl.col("cust_age")
    birth_year_filled = pl.coalesce(birth_year, birth_year.drop_nulls().first()).over("cust_id")
    
    # use it to fill the missing customer ages 
    res = (
        index.join(df, on=["year", "cust_id"], how="left")
        .select(
            "cust_id",
            "year",
            (pl.col("year") - birth_year_filled).alias("cust_age")
        )
        .sort("cust_id", "year") # optional
    )
    

    Output:

    >>> res
    
    shape: (9, 3)
    ┌─────────┬──────┬──────────┐
    │ cust_id ┆ year ┆ cust_age │
    │ ---     ┆ ---  ┆ ---      │
    │ i64     ┆ i64  ┆ i64      │
    ╞═════════╪══════╪══════════╡
    │ 1       ┆ 1999 ┆ 20       │
    │ 1       ┆ 2000 ┆ 21       │
    │ 1       ┆ 2001 ┆ 22       │
    │ 2       ┆ 1999 ┆ 31       │
    │ 2       ┆ 2000 ┆ 32       │
    │ 2       ┆ 2001 ┆ 33       │
    │ 3       ┆ 1999 ┆ 44       │
    │ 3       ┆ 2000 ┆ 45       │
    │ 3       ┆ 2001 ┆ 46       │
    └─────────┴──────┴──────────┘