pythonpandas

python pandas remove text in column names after specific characters


I have a Dataframe with hundreds of columns and very long column-names. I want to remove any text in the column-names after the ":" or "." This is basically splitting on multiple delimiters.

I tried to extract column-names in a list and use the split method at ":" and "." and then keep only the portion of the text before ":" or "." but the split did not work as I wanted. I do not know why. any idea how to fix it and achieve my goal.

data = {'Name of the injured. Bla bla bla': ['Bill', 'John'],
           'Age of the injured: bla bla': [50,40],
          }
df_data = pd.DataFrame.from_dict(data)
print(df_data)
cols = df_data.columns.values
new_cols = [( x.split(':') or x.split('.') ) for x in cols]
print(new_cols)

This is the outcome that I need:

enter image description here

Thanks, GR


Solution

  • You could use regular expression re in order to split the column name strings, and then reassing them to the dataframe :

    import pandas as pd
    import re
    data = {
        'Name of the injured. Bla bla bla': ['Bill', 'John'],
        'Age of the injured: bla bla': [50, 40]
    }
    
    df_data = pd.DataFrame.from_dict(data)
    

    then, following your intuition, you could do:

    cols = df_data.columns.values
    new_cols = [re.split(r'[.:]', item)[0] for item in cols]
    df_data.columns = new_cols
    

    If you have some other charcaters to split your text, just add it between the brackets, and here we choose to keep only the first part, hence the [0].

    The result for df_data is as expected:

      Name of the injured  Age of the injured
    0                Bill                  50
    1                John                  40