Is there any Polars-based optimization that can be applied to the apply-lambda methodology in this post Remove substring from column based on another column ?
In the polars dataframe below, how could we remove the "_sub" substrings based on the value of sub?
import polars as pl
pl.DataFrame(
{"origin": ["id1_COUNTRY", "id2_NAME"],
"sub": ["COUNTRY", "NAME"]}
)
shape: (2, 2)
┌─────────────┬─────────┐
│ origin ┆ sub │
│ --- ┆ --- │
│ str ┆ str │
╞═════════════╪═════════╡
│ id1_COUNTRY ┆ COUNTRY │
│ id2_NAME ┆ NAME │
└─────────────┴─────────┘
The expected output should look like:
shape: (2, 3)
┌─────────────┬─────────┬─────┐
│ origin ┆ sub ┆ out │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════════╪═════════╪═════╡
│ id1_COUNTRY ┆ COUNTRY ┆ id1 │
│ id2_NAME ┆ NAME ┆ id2 │
└─────────────┴─────────┴─────┘
In the given example, you are only stripping the suffix.
df.with_columns(
out = pl.col("origin").str.strip_suffix("_" + pl.col("sub"))
)
shape: (2, 3)
┌─────────────┬─────────┬─────┐
│ origin ┆ sub ┆ out │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════════╪═════════╪═════╡
│ id1_COUNTRY ┆ COUNTRY ┆ id1 │
│ id2_NAME ┆ NAME ┆ id2 │
└─────────────┴─────────┴─────┘
.replace_many() can be used for a general "substring" replacement.
df.with_columns(
out = (pl.col("origin") + "_other")
.str.replace_many("_" + pl.col("sub"), "")
)
shape: (2, 3)
┌─────────────┬─────────┬───────────┐
│ origin ┆ sub ┆ out │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════════╪═════════╪═══════════╡
│ id1_COUNTRY ┆ COUNTRY ┆ id1_other │
│ id2_NAME ┆ NAME ┆ id2_other │
└─────────────┴─────────┴───────────┘