pythonpython-polars

Can I conditionally extract data into multiple columns with a single `when`?


If I had the following toy data:

import polars as pl
data = [
    {"name_full": "McCartney, Paul"},
    {"name_last": "Lennon", "name_first": "John"},
    {"name_full": "Starr, Ringo"},
    {"name_last": "Harrison", "name_first": "George"}
]
df = pl.DataFrame(data)
df
shape: (4, 3)
┌─────────────────┬───────────┬────────────┐
│ name_full       ┆ name_last ┆ name_first │
│ ---             ┆ ---       ┆ ---        │
│ str             ┆ str       ┆ str        │
╞═════════════════╪═══════════╪════════════╡
│ McCartney, Paul ┆ null      ┆ null       │
│ null            ┆ Lennon    ┆ John       │
│ Starr, Ringo    ┆ null      ┆ null       │
│ null            ┆ Harrison  ┆ George     │
└─────────────────┴───────────┴────────────┘

Is there a recommended way in Polars to split the full name data into multiple columns without needing a separate pl.when for each target column. It seems this can be achieved with a struct, but I'm failing to come up with the correct logic/syntax.

I am able to achieve the desired results with the following, but this is tough to scale up for more complex data due to needing a separate pl.when for each column where data needs to land.

(
    df.with_columns(
        pl.col("name_full").str.split(",").list.eval(pl.element().str.strip_chars()).alias("name_parts")
    ).with_columns(
        pl.when(pl.col("name_last").is_null())
        .then(pl.col("name_parts").list.get(0, null_on_oob=True))
        .otherwise(pl.col("name_last")).alias("name_last"),
        pl.when(pl.col("name_first").is_null())
        .then(pl.col("name_parts").list.get(1, null_on_oob=True))
        .otherwise(pl.col("name_first")).alias("name_first")
    ).select(pl.all().exclude("name_full", "name_parts"))
)
shape: (4, 2)
┌───────────┬────────────┐
│ name_last ┆ name_first │
│ ---       ┆ ---        │
│ str       ┆ str        │
╞═══════════╪════════════╡
│ McCartney ┆ Paul       │
│ Lennon    ┆ John       │
│ Starr     ┆ Ringo      │
│ Harrison  ┆ George     │
└───────────┴────────────┘

Solution

  • If your input data name format of name_full is a string with a comma, then you need to keep your name_parts column to separate the first and last name.

    For the pl.when logic, one of the solutions is using coalesce to shorten the logic. It keeps the first non-null value.

    df.with_columns(
        pl.col("name_full").str.split(",").list.eval(pl.element().str.strip_chars()).alias("name_parts")
    ).select(
        pl.coalesce(pl.col("name_last"), pl.col("name_parts").list.get(0, null_on_oob=True)).alias("name_last"),
        pl.coalesce(pl.col("name_first"), pl.col("name_parts").list.get(1, null_on_oob=True)).alias("name_first")
    )
    
    shape: (4, 2)
    ┌───────────┬────────────┐
    │ name_last ┆ name_first │
    │ ---       ┆ ---        │
    │ str       ┆ str        │
    ╞═══════════╪════════════╡
    │ McCartney ┆ Paul       │
    │ Lennon    ┆ John       │
    │ Starr     ┆ Ringo      │
    │ Harrison  ┆ George     │
    └───────────┴────────────┘