pythondataframerenamepython-polars

Remove all whitespaces from the headers of a polars dataframe


I'm reading some csv files where the column headers are pretty annoying: they contain whitespaces, tabs, etc.

import polars as pl

csv_file = b'''
   A   \tB    \tC   \tD \t  E   
CD\tE\t300 0\t0\t0
CD\tE\t1071 0\t0\t0
K\tE\t390 0\t0\t0
'''.strip()

I want to read the file, then remove all whitespaces and/or tabs from the column names. Currently I do

df = pl.read_csv(csv_file,
                          comment_prefix='#',
                          separator='\t')
df = df.rename(lambda column_name: column_name.strip())

Is this the "polaric" way to do it? I'm not a big fan of lambdas, but if the only other solution is to write a function just for this, I guess I'll stick to lambdas.


Solution

  • Update: Polars 1.35.0

    .name.replace() was added in pull/17942 which allows regex replacements against column names

    >>> df.select(pl.all().name.replace("^\s+|\s+$", ""))
    # shape: (3, 5)
    # ┌─────┬─────┬────────┬─────┬─────┐
    # │ A   ┆ B   ┆ C      ┆ D   ┆ E   │
    # │ --- ┆ --- ┆ ---    ┆ --- ┆ --- │
    # │ str ┆ str ┆ str    ┆ i64 ┆ i64 │
    # ╞═════╪═════╪════════╪═════╪═════╡
    # │ CD  ┆ E   ┆ 300 0  ┆ 0   ┆ 0   │
    # │ CD  ┆ E   ┆ 1071 0 ┆ 0   ┆ 0   │
    # │ K   ┆ E   ┆ 390 0  ┆ 0   ┆ 0   │
    # └─────┴─────┴────────┴─────┴─────┘
    

    If we check the column names before and after reassigning df:

    df.columns
    # ['A   ', 'B    ', 'C   ', 'D ', '  E   ']
    
    df = df.select(pl.all().name.replace(r"^\s+|\s+$", ""))
    
    df.columns
    # ['A', 'B', 'C', 'D', 'E']
    

    Original answer

    The solution is to use a function as you have shown.

    However, in the case of .strip() without arguments it can be simplified slightly.

    Another way to write the strip is by using str.strip()

    >>> " A ".strip()
    # 'A'
    >>> str.strip(" A ")
    # 'A'
    

    str.strip and the lambda in this case do the same thing:

    one = lambda column: column.strip()
    two = str.strip
    
    >>> one(" A ")
    # 'A'
    >>> two(" A ")
    # 'A'
    

    df.rename() runs a function at the Python level, meaning we can pass str.strip directly.

    import polars as pl
    
    csv = b"""
    A \t B \t C
    1\t2\t3
    4\t5\t6
    """
    
    df = pl.read_csv(csv, separator="\t")
    
    >>> df.columns 
    # [' A ', ' B ', ' C']
    >>> df.rename(str.strip).columns
    # ['A', 'B', 'C']
    >>> df.rename(str.lower).columns
    # [' a ', ' b ', ' c']
    

    It's only useful if you're calling functions without additional arguments.

    For anything more complex, you'll need to use a lambda (or def).