pythonpandasdataframe

How to select column range based on partial column names in Pandas?


I have pandas dataframe and I am trying to select multiple columns (column range starting from Test to Bio Ref). Selection has to start from column Test to any column whose name starts with Bio. Below is the sample dataframe.

In reality it can contain:

  1. any number of columns before Test column,
  2. any number of columns between Test & Bio Ref like 2,3,4,5 etc.
  3. any number of columns after Bio Ref.
  4. Bio Ref column can contain suffix in it but Bio Ref will be there as start of column name always.
df_chunk = pd.DataFrame({
    'Waste':[None,None],
    'Test':['something', 'something'],
    '2':[None,None],
    '3':[None,None],
    'Bio Ref':['2-50','15-100'],
    'None':[None,None]})

df_chunk
   Waste         Test      2       3  Bio Ref  None
0   None    something   None    None     2-50  None
1   None    something   None    None   15-100  None

I have tried below codes that work:

df_chunk.columns.str.startswith('Bio')

df_chunk[df_chunk.columns[pd.Series(df_chunk.columns).str.startswith('Bio')==1]]

Issue: But when I try to use them for multiple column Selection then it doesn't work:

df_chunk.loc[:, 'Test':df_chunk.columns.str.startswith('Bio')]

Solution

  • You can creates masks for boolean indexing:

    m1 = np.maximum.accumulate(df_chunk.columns=='Test')
    # array([False,  True,  True,  True,  True,  True])
    m2 = np.maximum.accumulate(df_chunk.columns.str.startswith('Bio')[::-1])[::-1]
    # array([ True,  True,  True,  True,  True, False])
    
    # m1 & m2
    # array([False,  True,  True,  True,  True, False])
    
    out = df_chunk.loc[:, (m1&m2)]
    

    Or identify the correct names to build a slice:

    start = 'Test'
    end = next(iter(df_chunk.columns[df_chunk.columns.str.startswith('Bio')]), None)
    
    out = df_chunk.loc[:, slice(start, end)]
    

    Output:

            Test     2     3 Bio Ref
    0  something  None  None    2-50
    1  something  None  None  15-100