pythonpandasfunctional-programmingpurely-functional

Functional programming in Python with Pandas


I'm interested to know how one might re-write the following function, foo, within the functional programming paradigm. I can't figure out how to apply filter_df() to columns in kwargs and store the output without changing the the value of the original DataFrame object, df.


def foo(df, **kwargs):
    for column, values in kwargs.items():
        df = filter_df(df, column, values)
    return df

def filter_df(df, column, values):
    return df.loc[df[column].isin(values)].reset_index(drop=True)

An obvious solution to me might be to assign a new variable, df_new to the output of filter_df, e.g.

def foo(df, **kwargs):
    for column, values in kwargs.items():
        df_new = filter_df(df, column, values)
    return df_new

However, this is not particularly memory efficient as df could be quite large. Also, I'm not sure if this option be would classed as purely functional because df_new is affected on each loop iteration.


Solution

  • It's not totally clear what you mean by

    re-write the following function, foo, within the functional programming paradigm.

    and by

    However, this is not particularly memory efficient as df could be quite large. Also, I'm not sure if this option be would classed as purely functional because df_new is affected on each loop iteration.

    Note that your second definition of foo doesn't produce the same output as the first case, it only returns the rows that respect the last condition and ignores the remaining conditions.

    In each iteration, filter_df produces a new object (the rows of DataFrame which satisfy df[column].isin(values)) since reset_index doesn't act in-place. df_new is not "affected on each loop iteration", the name df_new is simply re-binded (i.e. points to) to a new object in each iteration. The conditions are being applied separately, only the DataFrame resulting from the last one is returned.

    Solution

    In this particular case, foo can be simplified using DataFrame.query. This way you don't create unnecessary intermediate DataFrames.

    def foo(df, **kwargs):
        query = ' and '.join(f'{col} in {vals}' for col, vals in kwargs.items())
        return df.query(query)