pythonpandasdataframe

pandas dataframe update with filter_func


I have two dataframes with identical shape and want to update df1 with df2 if some conditions are met

import pandas as pd
from typing import Any


df1 = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
print(df1, "\n")
df2 = pd.DataFrame({"A": [7, 8, 9], "B": [10, 3, 12]})
print(df2, "\n")

# Define a condition function
def condition(x: Any) -> bool:
    """condition function to update only cells matching the conditions"""
    return True if x in [2, 7, 9] else False

df1.update(df2)
print(df1)

but if I use filter_func df1.update(df2,filter_func=condition) it fails with ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Unfortunately the docu is not very verbose.

How to update a dataframe with conditions correctly?


Solution

  • Your function will receive a 1D numpy array (per column), it should be vectorized and return a boolean 1D array (callable(1d-array) -> bool 1d-array).

    Use numpy.isin to test membership:

    def condition(x):
        """condition function to update only cells matching the conditions"""
        return np.isin(x, [2, 7, 9])
    
    df1.update(df2, filter_func=condition)
    

    with a lambda:

    df1.update(df2, filter_func=lambda x: np.isin(x, [2, 7, 9]))
    

    Alternatively, if you really can't vectorize with pure numpy functions (this should not be done here!), decorate with numpy.vectorize:

    @np.vectorize
    def condition(x: Any) -> bool:
        """condition function to update only cells matching the conditions"""
        return True if x in [2, 7, 9] else False
    
    df1.update(df2, filter_func=condition)
    

    Updated df1:

       A  B
    0  1  4
    1  8  5
    2  3  6