I have a data frame as below:
df = pl.DataFrame({'last_name':[None,'mallesh','bhavik'],
'first_name':['a','b','c'],
'middle_name_or_initial':['aa','bb','cc']})
shape: (3, 3)
┌───────────┬────────────┬────────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════╪════════════╪════════════════════════╡
│ null ┆ a ┆ aa │
│ mallesh ┆ b ┆ bb │
│ bhavik ┆ c ┆ cc │
└───────────┴────────────┴────────────────────────┘
Here I would like to find an observation which has First and Middle Name not NULL and Last Name is Null.
In this case last_name should take the value from first_name, first_name should take the value from middle_name, and middle_name should be NULL.
expected output will be:
shape: (3, 3)
┌───────────┬────────────┬────────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════╪════════════╪════════════════════════╡
│ a ┆ aa ┆ null │
│ mallesh ┆ b ┆ bb │
│ bhavik ┆ c ┆ cc │
└───────────┴────────────┴────────────────────────┘
I'm trying with this command:
df.with_columns(
pl.when((pl.col('first_name').is_not_null()) & (pl.col('middle_name_or_initial').is_not_null()) & (pl.col('last_name').is_null()))
.then(pl.col('first_name').alias('last_name')).otherwise(pl.col('last_name').alias('first_name')),
pl.when((pl.col('first_name').is_not_null()) & (pl.col('middle_name_or_initial').is_not_null()) & (pl.col('last_name').is_null()))
.then(pl.col('middle_name_or_initial').alias('first_name')).alias('middle_name_or_initial')
)
But it is not producing the expected output.
You can actually change the values of multiple columns within a single when/then/otherwise
statement.
name_cols = "last_name", "first_name", "middle_name_or_initial",
(
df.with_columns(
pl.when(
pl.col("first_name").is_not_null(),
pl.col("middle_name_or_initial").is_not_null(),
pl.col("last_name").is_null()
)
.then(pl.struct(
last_name = "first_name",
first_name = "middle_name_or_initial",
middle_name_or_initial = None
))
.otherwise(pl.struct(name_cols))
.struct.unnest()
)
)
shape: (3, 3)
┌───────────┬────────────┬────────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════╪════════════╪════════════════════════╡
│ a ┆ aa ┆ null │
│ mallesh ┆ b ┆ bb │
│ bhavik ┆ c ┆ cc │
└───────────┴────────────┴────────────────────────┘
To change the values of multiple columns within a single when/then/otherwise statement, we can use Structs.
We use pl.struct()
to create multiple fields (something as multiple sub-columns) within a single column.
df.select(
then=pl.struct(
last_name = pl.col("first_name"),
first_name = pl.col("middle_name_or_initial"),
middle_name_or_initial = pl.lit(None),
),
otherwise = pl.struct(name_cols),
)
shape: (3, 2)
┌─────────────────┬──────────────────────┐
│ then ┆ otherwise │
│ --- ┆ --- │
│ struct[3] ┆ struct[3] │
╞═════════════════╪══════════════════════╡
│ {"a","aa",null} ┆ {null,"a","aa"} │
│ {"b","bb",null} ┆ {"mallesh","b","bb"} │
│ {"c","cc",null} ┆ {"bhavik","c","cc"} │
└─────────────────┴──────────────────────┘
.struct.unnest()
turns each field of the Struct into its own column and thereby replaces existing columns with the fields of the Struct if existing columns and fields of the Struct have the same name.