If I had the following toy data:
import polars as pl
data = [
{"name_full": "McCartney, Paul"},
{"name_last": "Lennon", "name_first": "John"},
{"name_full": "Starr, Ringo"},
{"name_last": "Harrison", "name_first": "George"}
]
df = pl.DataFrame(data)
df
shape: (4, 3)
┌─────────────────┬───────────┬────────────┐
│ name_full ┆ name_last ┆ name_first │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════════════╪═══════════╪════════════╡
│ McCartney, Paul ┆ null ┆ null │
│ null ┆ Lennon ┆ John │
│ Starr, Ringo ┆ null ┆ null │
│ null ┆ Harrison ┆ George │
└─────────────────┴───────────┴────────────┘
Is there a recommended way in Polars to split the full name data into multiple columns without needing a separate pl.when
for each target column. It seems this can be achieved with a struct
, but I'm failing to come up with the correct logic/syntax.
I am able to achieve the desired results with the following, but this is tough to scale up for more complex data due to needing a separate pl.when
for each column where data needs to land.
(
df.with_columns(
pl.col("name_full").str.split(",").list.eval(pl.element().str.strip_chars()).alias("name_parts")
).with_columns(
pl.when(pl.col("name_last").is_null())
.then(pl.col("name_parts").list.get(0, null_on_oob=True))
.otherwise(pl.col("name_last")).alias("name_last"),
pl.when(pl.col("name_first").is_null())
.then(pl.col("name_parts").list.get(1, null_on_oob=True))
.otherwise(pl.col("name_first")).alias("name_first")
).select(pl.all().exclude("name_full", "name_parts"))
)
shape: (4, 2)
┌───────────┬────────────┐
│ name_last ┆ name_first │
│ --- ┆ --- │
│ str ┆ str │
╞═══════════╪════════════╡
│ McCartney ┆ Paul │
│ Lennon ┆ John │
│ Starr ┆ Ringo │
│ Harrison ┆ George │
└───────────┴────────────┘
If your input data name format of name_full
is a string with a comma, then you need to keep your name_parts
column to separate the first and last name.
For the pl.when
logic, one of the solutions is using coalesce to shorten the logic. It keeps the first non-null value.
df.with_columns(
pl.col("name_full").str.split(",").list.eval(pl.element().str.strip_chars()).alias("name_parts")
).select(
pl.coalesce(pl.col("name_last"), pl.col("name_parts").list.get(0, null_on_oob=True)).alias("name_last"),
pl.coalesce(pl.col("name_first"), pl.col("name_parts").list.get(1, null_on_oob=True)).alias("name_first")
)
shape: (4, 2)
┌───────────┬────────────┐
│ name_last ┆ name_first │
│ --- ┆ --- │
│ str ┆ str │
╞═══════════╪════════════╡
│ McCartney ┆ Paul │
│ Lennon ┆ John │
│ Starr ┆ Ringo │
│ Harrison ┆ George │
└───────────┴────────────┘