pythonpandasdataframefiltercontains

How to filter dataframe column names containing 2 specified substrings?


I need the column names from the dataframe that contain both the term software and packages. I'm able to filter out columns containing one string.. for eg: software_cols = df.filter(regex='Software|software|SOFTWARE').columns

How do I achieve the same by mentioning 'Packages/packages/PACKAGES' as well.

Eligible column names should be like 'Local Software Packages', 'Software XYZ Packages', 'Software Package'


Solution

  • Keep things simple as you don't need a regex here, just use two boolean masks and a case independent comparison:

    # does the column name contain "software"?
    m1 = df.columns.str.contains('software', case=False)
    # does it contain "package"?
    m2 = df.columns.str.contains('package', case=False)
    
    # if both conditions are met, keep the column
    out = df.loc[:, m1&m2]
    

    Example input:

    df = pd.DataFrame(columns=['Local Software Packages', 'Software XYZ Packages', 'Software Package', 'Other', 'Software only'], index=[0])
    

    Output:

      Local Software Packages Software XYZ Packages Software Package
    0                     NaN                   NaN              NaN
    

    If you just want the names:

    df.columns[m1&m2]
    # Index(['Local Software Packages', 'Software XYZ Packages', 'Software Package'], dtype='object')