pythonpandasdataframe

Convert string of multiple pandas conditions to conditions usable for filtering df


I have an excel-sheet which has a lot of unneccessary information. Therefore I would like to filter out only those rows which contain specific values from a dataframe. However, the dataframe has no column names and I will need to filter the rows with the certain values from multiple dataframes with possibly different amounts of columns. Therefore I like to filter the rows out based on their values but without knowing the column name.

This is an example (here filtering out the rows which contain the value '00 bis 01 Uhr' or the value '01 bis 02 Uhr'):

df[(df == '00 bis 01 Uhr').any(axis=1)| (df == '01 bis 02 Uhr').any(axis=1) ]

However, I will need to filter out many more rows based on specific values, so I created a string with all the commands I need to apply. This is the string:

'(df == "00 bis 01 Uhr").any(axis = 1) | (df == "01 bis 02 Uhr").any(axis = 1) | (df == "02 bis 03 Uhr").any(axis = 1) | (df == "03 bis 04 Uhr").any(axis = 1) | (df == "04 bis 05 Uhr").any(axis = 1) | (df == "05 bis 06 Uhr").any(axis = 1) | (df == "06 bis 07 Uhr").any(axis = 1) | (df == "07 bis 08 Uhr").any(axis = 1) | (df == "08 bis 09 Uhr").any(axis = 1) | (df == "09 bis 10 Uhr").any(axis = 1) | (df == "10 bis 11 Uhr").any(axis = 1) | (df == "11 bis 12 Uhr").any(axis = 1) | (df == "12 bis 13 Uhr").any(axis = 1) | (df == "13 bis 14 Uhr").any(axis = 1) | (df == "14 bis 15 Uhr").any(axis = 1) | (df == "15 bis 16 Uhr").any(axis = 1) | (df == "16 bis 17 Uhr").any(axis = 1) | (df == "17 bis 18 Uhr").any(axis = 1) | (df == "18 bis 19 Uhr").any(axis = 1) | (df == "19 bis 20 Uhr").any(axis = 1) | (df == "20 bis 21 Uhr").any(axis = 1) | (df == "21 bis 22 Uhr").any(axis = 1) | (df == "22 bis 23 Uhr").any(axis = 1) | (df == "23 bis 24 Uhr").any(axis = 1)'

How can I use this string to actually select the relevant rows?


Solution

  • This should work.

    time_ranges = ["00 bis 01 Uhr", "01 bis 02 Uhr", "02 bis 03 Uhr", "03 bis 04 Uhr", 
                   "04 bis 05 Uhr", "05 bis 06 Uhr", "06 bis 07 Uhr", "07 bis 08 Uhr", 
                   "08 bis 09 Uhr", "09 bis 10 Uhr", "10 bis 11 Uhr", "11 bis 12 Uhr", 
                   "12 bis 13 Uhr", "13 bis 14 Uhr", "14 bis 15 Uhr", "15 bis 16 Uhr", 
                   "16 bis 17 Uhr", "17 bis 18 Uhr", "18 bis 19 Uhr", "19 bis 20 Uhr", 
                   "20 bis 21 Uhr", "21 bis 22 Uhr", "22 bis 23 Uhr", "23 bis 24 Uhr"]
    
    # Create a regex pattern that matches any of the time ranges
    pattern = '|'.join(time_ranges)
    
    # Select rows where any column contains the pattern
    matching_rows = df[df.apply(lambda row: row.str.contains(pattern).any(), axis=1)]
    
    # Output the matching rows
    print(matching_rows)
    

    or you could just use regex directly and make it simpler

    # Regular expression pattern to match any of the time ranges
    pattern = r'(\d{2} bis \d{2} Uhr)'
    
    # Select rows where any column contains a match for the regex pattern
    matching_rows = df[df.apply(lambda row: row.str.contains(pattern, regex=True).any(), axis=1)]
    
    # Output the matching rows
    print(matching_rows)