pythonpandasregexdataframe

Remove special character and units form Pandas column name with Python


I'm working on a script to convert a data file from one format to another. I need to remove the special characters from the column headers.

I am using Pandas to read a CSV file with the below structure. I'm looking for a tidy way to remove the [units] form the column name.

Data File:

Date        ,Time        ,app1_sum [Ml] ,app1_q [l/s]  ,app1_h [m] ,app1_a [m²] ,app1_v [m/s] ,app1_t_water [°C]
02.10.2024  ,19:05:00    ,57293.336     ,620.78        ,0.436      ,0.586       ,1.059        ,18.2

My goal is to have column names stripped down to their simplest form: Date,Time,app1_sum,app1_q,app1_h,app1_a,app1_v,app1_t_water

My current approach is to remove the brackers first, then remove the remining units one at a time.

df.columns = df.columns.str.replace('[', '')
df.columns = df.columns.str.replace(']', '')
df.columns = df.columns.str.replace(' Ml', '') 

I also tried to use regex to remove one unit at a time, this works but doesn't feel right.

df.rename(columns=lambda x: re.sub(r'\[Ml\]', '', x), inplace=True)
df.rename(columns=lambda x: re.sub(r'\[l/s\]', '', x), inplace=True)
df.rename(columns=lambda x: re.sub(r'\[m\]', '', x), inplace=True)
df.rename(columns=lambda x: re.sub(r'\[m²\]', '', x), inplace=True)
df.rename(columns=lambda x: re.sub(r'\[m/s\]', '', x), inplace=True)
df.rename(columns=lambda x: re.sub(r'\[°C\]', '', x), inplace=True)

Solution

  • You can use the following code to replace all brackets and their content in the columns names:

    df.rename(columns=lambda x: re.sub(r'\[[^\]]+\]', '', x), inplace=True)
    

    \[ will match the opening bracket [

    [^\]]+ will match an abtritrary number of every character except ]

    \] will match the closing bracket

    The \ are needed to escape the brackets since they are part of regex syntax