I'm interested to know how one might re-write the following function, foo
, within the functional programming paradigm. I can't figure out how to apply filter_df()
to columns in kwargs and store the output without changing the the value of the original DataFrame object, df
.
def foo(df, **kwargs):
for column, values in kwargs.items():
df = filter_df(df, column, values)
return df
def filter_df(df, column, values):
return df.loc[df[column].isin(values)].reset_index(drop=True)
An obvious solution to me might be to assign a new variable, df_new
to the output of filter_df
, e.g.
def foo(df, **kwargs):
for column, values in kwargs.items():
df_new = filter_df(df, column, values)
return df_new
However, this is not particularly memory efficient as df
could be quite large. Also, I'm not sure if this option be would classed as purely functional because df_new
is affected on each loop iteration.
It's not totally clear what you mean by
re-write the following function,
foo
, within the functional programming paradigm.
and by
However, this is not particularly memory efficient as df could be quite large. Also, I'm not sure if this option be would classed as purely functional because df_new is affected on each loop iteration.
Note that your second definition of foo
doesn't produce the same output as the first case, it only returns the rows that respect the last condition and ignores the remaining conditions.
In each iteration, filter_df
produces a new object (the rows of DataFrame which satisfy df[column].isin(values)
) since reset_index
doesn't act in-place. df_new
is not "affected on each loop iteration", the name df_new
is simply re-binded (i.e. points to) to a new object in each iteration. The conditions are being applied separately, only the DataFrame resulting from the last one is returned.
Solution
In this particular case, foo
can be simplified using DataFrame.query
. This way you don't create unnecessary intermediate DataFrames.
def foo(df, **kwargs):
query = ' and '.join(f'{col} in {vals}' for col, vals in kwargs.items())
return df.query(query)