I have a large Excel spreadsheet. I'm only interested in certain columns. Furthermore, I'm only interested in rows where specific columns meet certain criteria.
The following works:
import pandas as pd
import warnings
# this suppresses the openpyxl warning that we're seeing
warnings.filterwarnings("ignore", category=UserWarning, module="openpyxl")
# These are the columns we're interested in
COLUMNS = [
"A",
"B",
"C"
]
# the source file
XL = "source.xlsx"
# sheet name in the source file
SHEET = "Sheet1"
# the output file
OUTPUT = "target.xlsx"
# the sheet name to be used in the output file
OUTSHEET = "Sheet1"
# This loads the entire spreadsheet into a pandas dataframe
df = pd.read_excel(XL, sheet_name=SHEET, usecols=COLUMNS).dropna()
# this replaces the original dataframe with rows where A contains "FOO"
df = df[df["A"].str.contains(r"\bFOO\b", regex=True)]
# now isolate those rows where the B contains "BAR"
df = df[df["B"].str.contains(r"\bBAR\b", regex=True)]
# output to the new spreadsheet
df.to_excel(OUTPUT, sheet_name=OUTSHEET, index=False)
This works. However, I can't help thinking that there might be a better way to manage the selection criteria especially if / when they get more complex.
Or is it a case of "step-by-step" is good?
You can certainly chain all your commands to avoid using intermediate variables, and combine all filters into a single expression (for example defining the condition in a col:regex dictionary and using loc
with numpy.logical_and
.reduce
):
conditions = {'A': r'\bFOO\b', 'B': r'\bBAR\b'}
(pd.read_excel(XL, sheet_name=SHEET, usecols=COLUMNS).dropna()
.loc[lambda x: np.logical_and.reduce([x[col].str.contains(cond, regex=True)
for col, cond in conditions.items()])]
.to_excel(OUTPUT, sheet_name=OUTSHEET, index=False)
)
Alternative with a custom filtering function:
def cust_filter(df):
m1 = df['A'].str.contains(r'\bFOO\b', regex=True)
m2 = df['B'].str.contains(r'\bBAR\b', regex=True)
return df[m1 & m2]
(pd.read_excel(XL, sheet_name=SHEET, usecols=COLUMNS).dropna()
.pipe(cust_filter)
.to_excel(OUTPUT, sheet_name=OUTSHEET, index=False)
)
Example input:
A B C
0 ABC GHI other
1 FOO BAR other
2 FOO JKL other
3 DEF BAR other
Example output:
A B C
1 FOO BAR other