I'm reading some csv files where the column headers are pretty annoying: they contain whitespaces, tabs, etc.
import polars as pl
csv_file = b'''
A \tB \tC \tD \t E
CD\tE\t300 0\t0\t0
CD\tE\t1071 0\t0\t0
K\tE\t390 0\t0\t0
'''.strip()
I want to read the file, then remove all whitespaces and/or tabs from the column names. Currently I do
df = pl.read_csv(csv_file,
comment_prefix='#',
separator='\t')
df = df.rename(lambda column_name: column_name.strip())
Is this the "polaric" way to do it? I'm not a big fan of lambdas, but if the only other solution is to write a function just for this, I guess I'll stick to lambdas.
Update: Polars 1.35.0
.name.replace() was added in pull/17942 which allows regex replacements against column names
>>> df.select(pl.all().name.replace("^\s+|\s+$", ""))
# shape: (3, 5)
# ┌─────┬─────┬────────┬─────┬─────┐
# │ A ┆ B ┆ C ┆ D ┆ E │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ i64 ┆ i64 │
# ╞═════╪═════╪════════╪═════╪═════╡
# │ CD ┆ E ┆ 300 0 ┆ 0 ┆ 0 │
# │ CD ┆ E ┆ 1071 0 ┆ 0 ┆ 0 │
# │ K ┆ E ┆ 390 0 ┆ 0 ┆ 0 │
# └─────┴─────┴────────┴─────┴─────┘
If we check the column names before and after reassigning df:
df.columns
# ['A ', 'B ', 'C ', 'D ', ' E ']
df = df.select(pl.all().name.replace(r"^\s+|\s+$", ""))
df.columns
# ['A', 'B', 'C', 'D', 'E']
Original answer
The solution is to use a function as you have shown.
However, in the case of .strip() without arguments it can be simplified slightly.
Another way to write the strip is by using str.strip()
>>> " A ".strip()
# 'A'
>>> str.strip(" A ")
# 'A'
str.strip and the lambda in this case do the same thing:
one = lambda column: column.strip()
two = str.strip
>>> one(" A ")
# 'A'
>>> two(" A ")
# 'A'
df.rename() runs a function at the Python level, meaning we can pass str.strip directly.
import polars as pl
csv = b"""
A \t B \t C
1\t2\t3
4\t5\t6
"""
df = pl.read_csv(csv, separator="\t")
>>> df.columns
# [' A ', ' B ', ' C']
>>> df.rename(str.strip).columns
# ['A', 'B', 'C']
>>> df.rename(str.lower).columns
# [' a ', ' b ', ' c']
It's only useful if you're calling functions without additional arguments.
For anything more complex, you'll need to use a lambda (or def).