pandasdataframedrop

Keep first instance of column and drop others, using partial text in column name


A df has column ABC_81, ABC_y, ABC_aa, ZXC_aa, ZXC_5, ZXC_hi:

data = {
    'ABC_81': [1, 2, 3],
    'ABC_y': [4, 5, 6],
    'ABC_aa': [7, 8, 9],
    'ZXC_aa': [10, 11, 12],
    'ZXC_5': [13, 14, 15],
    'ZXC_hi': [16, 17, 18],
    'PPP_tr' : [88,55,99],
    'QWE_gf' : [78,98,14]
    }

Going from left to right in the df, I want to keep first instance of column name starting with ABC and ZXC, drop other columns starting with ABC and ZXC.

Keep ABC_81,
Drop ABC_y, ABC_aa

Similarly

Keep ZXC_aa,
Drop ZXC_5, ZXC_hi

and so on

Result should be:

data = {
    'ABC_81': [1, 2, 3],
    'ZXC_aa': [10, 11, 12],
    'PPP_tr' : [88,55,99],
    'QWE_gf' : [78,98,14]
    }

Additionally, can this be done without knowing the column names?


Solution

  • Assuming the fixed part come before the underscore, you can use a regex to extract the prefix, then identify the first value with duplicated and finally use boolean indexing to select the first column for each prefix:

    m = ~df.columns.str.extract(r'^([^_]+)', expand=False).duplicated()
    
    out = df.loc[:, m]
    

    As a one-liner:

    out = df.loc[:, ~df.columns.str.extract(r'^([^_]+)', expand=False).duplicated()]
    

    Output:

       ABC_81  ZXC_aa  PPP_tr  QWE_gf
    0       1      10      88      78
    1       2      11      55      98
    2       3      12      99      14
    

    Intermediates:

    df.columns.str.extract(r'^([^_]+)', expand=False)
    # Index(['ABC', 'ABC', 'ABC', 'ZXC', 'ZXC', 'ZXC'], dtype='object')
    
    df.columns.str.extract(r'^([^_]+)', expand=False).duplicated()
    # array([False,  True,  True, False,  True,  True, False, False])
    
    ~df.columns.str.extract(r'^([^_]+)', expand=False).duplicated()
    # array([ True, False, False,  True, False, False,  True,  True])
    

    regex demo

    more complex filter

    You can add more conditions if needed:

    # extract prefix
    prefix = df.columns.str.extract(r'^([^_]+)', expand=False)
    # identify duplicated columns that match ABC/ZXC
    m = prefix.isin(['ABC', 'ZXC']) & prefix.duplicated()
    
    # select all but duplicated ABC/ZXC
    out = df.loc[:, ~m]