pythondataframepython-polars

Polars: Remove substring from column based on another column


Is there any Polars-based optimization that can be applied to the apply-lambda methodology in this post Remove substring from column based on another column ?

In the polars dataframe below, how could we remove the "_sub" substrings based on the value of sub?

import polars as pl

pl.DataFrame(
    {"origin": ["id1_COUNTRY", "id2_NAME"],
     "sub": ["COUNTRY", "NAME"]}
)

shape: (2, 2)
┌─────────────┬─────────┐
│ origin      ┆ sub     │
│ ---         ┆ ---     │
│ str         ┆ str     │
╞═════════════╪═════════╡
│ id1_COUNTRY ┆ COUNTRY │
│ id2_NAME    ┆ NAME    │
└─────────────┴─────────┘

The expected output should look like:

shape: (2, 3)
┌─────────────┬─────────┬─────┐
│ origin      ┆ sub     ┆ out │
│ ---         ┆ ---     ┆ --- │
│ str         ┆ str     ┆ str │
╞═════════════╪═════════╪═════╡
│ id1_COUNTRY ┆ COUNTRY ┆ id1 │
│ id2_NAME    ┆ NAME    ┆ id2 │
└─────────────┴─────────┴─────┘

Solution

  • In the given example, you are only stripping the suffix.

    df.with_columns(
       out = pl.col("origin").str.strip_suffix("_" + pl.col("sub"))
    )
    
    shape: (2, 3)
    ┌─────────────┬─────────┬─────┐
    │ origin      ┆ sub     ┆ out │
    │ ---         ┆ ---     ┆ --- │
    │ str         ┆ str     ┆ str │
    ╞═════════════╪═════════╪═════╡
    │ id1_COUNTRY ┆ COUNTRY ┆ id1 │
    │ id2_NAME    ┆ NAME    ┆ id2 │
    └─────────────┴─────────┴─────┘
    

    .replace_many() can be used for a general "substring" replacement.

    df.with_columns(
       out = (pl.col("origin") + "_other")
                .str.replace_many("_" + pl.col("sub"), "")
    )
    
    shape: (2, 3)
    ┌─────────────┬─────────┬───────────┐
    │ origin      ┆ sub     ┆ out       │
    │ ---         ┆ ---     ┆ ---       │
    │ str         ┆ str     ┆ str       │
    ╞═════════════╪═════════╪═══════════╡
    │ id1_COUNTRY ┆ COUNTRY ┆ id1_other │
    │ id2_NAME    ┆ NAME    ┆ id2_other │
    └─────────────┴─────────┴───────────┘