I have a dataframe with incomplete values as below - in particular ages with corresponding years, and I would like to make it square (i.e., all three cust_id to have correctly imputed values for age in all three years, i.e. I want to turn this:
df = pl.DataFrame({
"cust_id": [1, 2 ,2, 2, 3, 3],
"year": [2000,1999,2000,2001,1999,2001],
"cust_age": [21,31,32,33,44,46]
})
into this:
df = pl.DataFrame({
"cust_id": [1, 1, 1, 2 ,2, 2, 3, 3, 3 ],
"year": [1999,2000,2001,1999,2000,2001,1999,2000,2001],
"cust_age": [20,21,22,31,32,33,44,45,46]
})
I know I can iteratively create new dataframes and then join or shift values, and potentially use apply and/or eval_map but I was hoping to find something quick and idiomatic to polars. Thanks in advance for your help!
Here is a possible approach
# create all (year, cust_id) combinations
index = df.select("year").unique().join(df.select("cust_id").unique(), how="cross")
# compute the birth year of each customer as an expression
birth_year = pl.col("year") - pl.col("cust_age")
birth_year_filled = pl.coalesce(birth_year, birth_year.drop_nulls().first()).over("cust_id")
# use it to fill the missing customer ages
res = (
index.join(df, on=["year", "cust_id"], how="left")
.select(
"cust_id",
"year",
(pl.col("year") - birth_year_filled).alias("cust_age")
)
.sort("cust_id", "year") # optional
)
Output:
>>> res
shape: (9, 3)
┌─────────┬──────┬──────────┐
│ cust_id ┆ year ┆ cust_age │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════════╪══════╪══════════╡
│ 1 ┆ 1999 ┆ 20 │
│ 1 ┆ 2000 ┆ 21 │
│ 1 ┆ 2001 ┆ 22 │
│ 2 ┆ 1999 ┆ 31 │
│ 2 ┆ 2000 ┆ 32 │
│ 2 ┆ 2001 ┆ 33 │
│ 3 ┆ 1999 ┆ 44 │
│ 3 ┆ 2000 ┆ 45 │
│ 3 ┆ 2001 ┆ 46 │
└─────────┴──────┴──────────┘