pythonpandasopenpyxl

Optimum selection mechanism when choosing relevant rows from a dataframe


I have a large Excel spreadsheet. I'm only interested in certain columns. Furthermore, I'm only interested in rows where specific columns meet certain criteria.

The following works:

import pandas as pd
import warnings

# this suppresses the openpyxl warning that we're seeing
warnings.filterwarnings("ignore", category=UserWarning, module="openpyxl")

# These are the columns we're interested in
COLUMNS = [
    "A",
    "B",
    "C"
]

# the source file
XL = "source.xlsx"
# sheet name in the source file
SHEET = "Sheet1"
# the output file
OUTPUT = "target.xlsx"
# the sheet name to be used in the output file
OUTSHEET = "Sheet1"

# This loads the entire spreadsheet into a pandas dataframe
df = pd.read_excel(XL, sheet_name=SHEET, usecols=COLUMNS).dropna()
# this replaces the original dataframe with rows where A contains "FOO"
df = df[df["A"].str.contains(r"\bFOO\b", regex=True)]
# now isolate those rows where the B contains "BAR"
df = df[df["B"].str.contains(r"\bBAR\b", regex=True)]
# output to the new spreadsheet
df.to_excel(OUTPUT, sheet_name=OUTSHEET, index=False)

This works. However, I can't help thinking that there might be a better way to manage the selection criteria especially if / when they get more complex.

Or is it a case of "step-by-step" is good?


Solution

  • You can certainly chain all your commands to avoid using intermediate variables, and combine all filters into a single expression (for example defining the condition in a col:regex dictionary and using loc with numpy.logical_and.reduce):

    conditions = {'A': r'\bFOO\b', 'B': r'\bBAR\b'}
    
    (pd.read_excel(XL, sheet_name=SHEET, usecols=COLUMNS).dropna()
       .loc[lambda x: np.logical_and.reduce([x[col].str.contains(cond, regex=True)
                                             for col, cond in conditions.items()])]
       .to_excel(OUTPUT, sheet_name=OUTSHEET, index=False)
    )
    

    Alternative with a custom filtering function:

    def cust_filter(df):
        m1 = df['A'].str.contains(r'\bFOO\b', regex=True)
        m2 = df['B'].str.contains(r'\bBAR\b', regex=True)
        return df[m1 & m2]
    
    (pd.read_excel(XL, sheet_name=SHEET, usecols=COLUMNS).dropna()
       .pipe(cust_filter)
       .to_excel(OUTPUT, sheet_name=OUTSHEET, index=False)
    )
    

    Example input:

         A    B      C
    0  ABC  GHI  other
    1  FOO  BAR  other
    2  FOO  JKL  other
    3  DEF  BAR  other
    

    Example output:

         A    B      C
    1  FOO  BAR  other