I am reading a csv file and need to normalize the column names as part of a larger function chaining operation. I want to do everything with function chaining.
When using the recommended name.map
function for replacing chars in columns like:
import polars as pl
df = pl.DataFrame(
{"A (%)": [1, 2, 3], "B": [4, 5, 6], "C (Euro)": ["abc", "def", "ghi"]}
).with_columns(
pl.all().name.map(
lambda c: c.replace(" ", "_")
.replace("(%)", "pct")
.replace("(Euro)", "euro")
.lower()
)
)
df.head()
I get
shape: (3, 6)
┌───────┬─────┬──────────┬───────┬─────┬────────┐
│ A (%) ┆ B ┆ C (Euro) ┆ a_pct ┆ b ┆ c_euro │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ i64 ┆ i64 ┆ str │
╞═══════╪═════|══════════╡═══════╡═════╡════════╡
│ 1 ┆ 4 ┆ "abc" ┆ 1 ┆ 4 ┆ "abc" │
│ 2 ┆ 5 ┆ "def" ┆ 2 ┆ 5 ┆ "def" │
│ 3 ┆ 6 ┆ "ghi" ┆ 3 ┆ 6 ┆"ghi" │
└───────┴─────┴──────────┴───────┴─────┴────────┘
instead of the expected
shape: (3, 3)
┌───────┬─────┬────────┐
│ a_pct ┆ b ┆ c_euro │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═══════╪═════|════════╡
│ 1 ┆ 4 ┆ "abc" │
│ 2 ┆ 5 ┆ "def" │
│ 3 ┆ 6 ┆ "ghi" │
└───────┴─────┴────────┘
?
How can I replace specific chars in existing column names with function chaining without creating new columns?
You could simply replace DataFrame.with_columns()
with DataFrame.select()
method:
df = pl.DataFrame(
{"A (%)": [1, 2, 3], "B": [4, 5, 6], "C (Euro)": ["abc", "def", "ghi"]}
).select(
pl.all().name.map(
lambda c: c.replace(" ", "_")
.replace("(%)", "pct")
.replace("(Euro)", "euro")
.lower()
)
)
┌───────┬─────┬────────┐
│ a_pct ┆ b ┆ c_euro │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═══════╪═════╪════════╡
│ 1 ┆ 4 ┆ abc │
│ 2 ┆ 5 ┆ def │
│ 3 ┆ 6 ┆ ghi │
└───────┴─────┴────────┘
IT would be important to say (as Dean MacGregor mentioned in the comments), that DataFrame.with_columns()
always adds columns to the dataframe.
The column names might be the same as the ones in the original dataframe, but in that case original columns will be replaced with the new ones. You can see it in the documentation:
Add columns to this DataFrame.
Added columns will replace existing columns with the same name.
DataFrame.select()
, on the other hand, selects existing columns of the dataframe.
Additionally, if you just want to rename all the columns, it's probably more natural to use DataFrame.rename()
instead:
...
.rename(
lambda c: c.replace(" ", "_")
.replace("(%)", "pct")
.replace("(Euro)", "euro")
.lower()
)
┌───────┬─────┬────────┐
│ a_pct ┆ b ┆ c_euro │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═══════╪═════╪════════╡
│ 1 ┆ 4 ┆ abc │
│ 2 ┆ 5 ┆ def │
│ 3 ┆ 6 ┆ ghi │
└───────┴─────┴────────┘