pythonpandasvectorizationcorrelated-subquery

Vectorized solution for getting a value from another row in the same dataframe


I have a dataframe that looks like this:

Category Date Value PrevDate1 PrevDate2
A 2022-07-15 4 2022-07-01 2022-07-02
B 2022-07-16 2 2022-07-08 2022-07-09
B 2022-07-17 6 2022-07-05 2022-07-16
A 2022-07-18 7 2022-07-10 2022-07-15
A 2022-07-19 9 2022-07-18 2022-07-15

I want to add a column that selects Value from another part of the dataframe. If PrevDate1 is more recent than PrevDate2, it should return Value where PrevDate1 is equal to Date. Or equal to PrevDate2 if that is more recent. Each date is unique, so I should just be able to match the previous date with the date column and get the associated value without referring to the category.

The output would look like this:

Category Date Value PrevDate1 PrevDate2 PrevVal
A 2022-07-15 4 2022-07-01 2022-07-02 X
B 2022-07-16 2 2022-07-08 2022-07-09 Y
B 2022-07-17 6 2022-07-05 2022-07-16 2
A 2022-07-18 7 2022-07-10 2022-07-15 4
A 2022-07-19 9 2022-07-18 2022-07-15 7

I can do this with a loop, but this is slow and I keep seeing that there should be a vectorized solution for everything in pandas. I've tried variations of:

df['PrevVal'] = np.where(df['PrevDate1'] > df['PrevDate2'], df['Value'][df['Date'] == df['PrevDate1'], df['Value'][df['Date'] == df['PrevDate2'])

But this doesn't work, and I can see why it wouldn't. I've been looking for the python equivalent of a correlated subquery in SQL, but haven't been able to find anything.

The SQL code looks something like this:

update table set PrevVal = (select case when PrevDate1 > PrevDate2
then (select Value from table where Date = tb.PrevDate1)) 
from table tb

This tells it to search Date in the whole table for PrevDate1 in a given row.


Solution

  • Example code

    data = {'Category': {0: 'A', 1: 'B', 2: 'B', 3: 'A', 4: 'A'},
            'Date': {0: '2022-07-15', 1: '2022-07-16', 2: '2022-07-17', 3: '2022-07-18', 4: '2022-07-19'},
            'Value': {0: 4, 1: 2, 2: 6, 3: 7, 4: 9},
            'PrevDate1': {0: '2022-07-01', 1: '2022-07-08', 2: '2022-07-05', 3: '2022-07-10', 4: '2022-07-18'},
            'PrevDate2': {0: '2022-07-02', 1: '2022-07-09', 2: '2022-07-16', 3: '2022-07-15', 4: '2022-07-15'}}
    df = pd.DataFrame(data)
    

    Process

    make mapper(it is series) for mapping

    mapper = df['Value'].set_axis(df['Date'])
    

    mapper

    Date
    2022-07-15    4
    2022-07-16    2
    2022-07-17    6
    2022-07-18    7
    2022-07-19    9
    Name: Value, dtype: int64
    

    mapping by mapper to maximum(most recently) prev date

    df[['PrevDate1', 'PrevDate2']].max(axis=1).map(mapper)
    

    result:

    0   NaN
    1   NaN
    2   2.0
    3   4.0
    4   7.0
    Name: PrevDate2, dtype: float64
    

    make result to PrevVal column



    Full Code and Output

    mapper = df['Value'].set_axis(df['Date'])
    df.assign(PrevVal=df[['PrevDate1', 'PrevDate2']].max(axis=1).map(mapper))
    

    output:

        Category    Date        Value   PrevDate1   PrevDate2   PrevVal
    0   A           2022-07-15  4       2022-07-01  2022-07-02  NaN
    1   B           2022-07-16  2       2022-07-08  2022-07-09  NaN
    2   B           2022-07-17  6       2022-07-05  2022-07-16  2.0
    3   A           2022-07-18  7       2022-07-10  2022-07-15  4.0
    4   A           2022-07-19  9       2022-07-18  2022-07-15  7.0
    

    If you need to be different for each category, use merge instead map