pythonpandasnumpy

Filter rows in data frame based on previous row condition


import pandas as pd


data = {
    'ITEM': [1]*22,
    'WAREHOUSE': [1000]*22,
    'DAILY_DATE': ['9/20/2024', '9/20/2024', '9/21/2024', '9/21/2024', '9/22/2024', 
                   '9/22/2024', '9/23/2024', '9/23/2024', '9/24/2024', '9/24/2024', 
                   '9/25/2024', '9/25/2024', '9/26/2024', '9/26/2024', '9/27/2024', 
                   '9/27/2024', '9/28/2024', '9/28/2024', '9/29/2024', '9/29/2024', '9/30/2024', 
                   '9/30/2024'],
    'LOAD_ARRIVAL_DATE': ['9/26/2024', '9/26/2024', '', '', '', '', '9/29/2024', '9/29/2024', 
                        '9/30/2024', '9/30/2024', '', '', '', '', '', '', '', '', '', '', '', ''],
    'FCST_CREATE_DATE': ['9/14/2024', '9/7/2024', '9/14/2024', '9/7/2024', '9/21/2024', 
                                '9/14/2024', '9/21/2024', '9/14/2024', '9/21/2024', '9/14/2024', 
                                '9/21/2024', '9/14/2024', '9/21/2024', '9/14/2024', '9/21/2024', 
                                '9/14/2024', '9/21/2024', '9/14/2024', '9/21/2024','9/28/2024', '9/28/2024', 
                                '9/21/2024'],
    'FCST_QTY': [17.06, 10.4624, 12.1876, 16.2801, 8.8821, 8.3799, 11.1043, 10.7751, 
                              8.8821, 5.9846, 5.5555, 8.3799, 11.1043, 16.7597, 11.1043, 9.5809, 
                              11.1043, 8.3799, 130.02, 157.2252, 48.6241, 235.9502]
}


df = pd.DataFrame(data)

I have a data frame, df, that gives daily forecast data for an item at a warehouse. Each date is prescribed two forecast quantities, one representing the forecast made 1 week prior, and another 2 weeks prior, and this corresponds to FCST_CREATE_DT. LOAD_ARRIVAL_DATE describes if a future shipment order was placed that day.

This is a bit hard for me to verbally describe, but I essentially want to filter my data frame so that I only keep the most recent forecast for when an order was placed. Basically, putting myself back in time, and only using the forecasts that were available at that time.

Visually, the table looks as such:

ITEM WAREHOUSE DAILY_DATE LOAD_ARRIVAL_DATE FCST_CREATE_DATE FCST_QTY
1 1000 9/20/2024 9/26/2024 9/14/2024 17.06
1 1000 9/20/2024 9/26/2024 9/7/2024 10.4624
1 1000 9/21/2024 9/14/2024 12.1876
1 1000 9/21/2024 9/7/2024 16.2801
1 1000 9/22/2024 9/21/2024 8.8821
1 1000 9/22/2024 9/14/2024 8.3799
1 1000 9/23/2024 9/29/2024 9/21/2024 11.1043
1 1000 9/23/2024 9/29/2024 9/14/2024 10.7751
1 1000 9/24/2024 9/30/2024 9/21/2024 8.8821
1 1000 9/24/2024 9/30/2024 9/14/2024 5.9846
1 1000 9/25/2024 9/21/2024 5.5555
1 1000 9/25/2024 9/14/2024 8.3799
1 1000 9/26/2024 9/21/2024 11.1043
1 1000 9/26/2024 9/14/2024 16.7597
1 1000 9/27/2024 9/21/2024 11.1043
1 1000 9/27/2024 9/14/2024 9.5809
1 1000 9/28/2024 9/21/2024 11.1043
1 1000 9/28/2024 9/14/2024 8.3799
1 1000 9/29/2024 9/21/2024 130.02
1 1000 9/29/2024 9/28/2024 157.2252
1 1000 9/30/2024 9/28/2024 48.6241
1 1000 9/30/2024 9/21/2024 235.9502

After, removing the unnecessary rows, I want it to look like:

ITEM WAREHOUSE DAILY_DATE LOAD_ARRIVAL_DATE FCST_CREATE_DATE FCST_QTY
1 1000 9/20/2024 9/26/2024 9/14/2024 17.06
1 1000 9/20/2024 9/26/2024 9/7/2024 10.4624
1 1000 9/21/2024 9/14/2024 12.1876
1 1000 9/21/2024 9/7/2024 16.2801
1 1000 9/22/2024 9/21/2024 8.8821
1 1000 9/22/2024 9/14/2024 8.3799
1 1000 9/23/2024 9/29/2024 9/21/2024 11.1043
1 1000 9/23/2024 9/29/2024 9/14/2024 10.7751
1 1000 9/24/2024 9/30/2024 9/21/2024 8.8821
1 1000 9/24/2024 9/30/2024 9/14/2024 5.9846
1 1000 9/25/2024 9/21/2024 5.5555
1 1000 9/25/2024 9/14/2024 8.3799
1 1000 9/26/2024 9/14/2024 16.7597
1 1000 9/27/2024 9/14/2024 9.5809
1 1000 9/28/2024 9/14/2024 8.3799
1 1000 9/29/2024 9/21/2024 130.02
11 1000 9/30/2024 9/21/2024 235.9502

Any rows where DAILY_DATE is before the first instance of LOAD_ARRIVAL_DT can be ignored. In this case, since the first instance of LOAD_ARRIVAL_DT is 9/26/2024, all rows where DAILY_DATE is less than that are ignored. From the date range of the first instance of LOAD_ARRIVAL_DT to the next instance of LOAD_ARRIVAL_DT, I want to consider only the most recent forecast that was retroactively available. So the date range would be 9/26/24 - 9/29/24. (Not inclusive of the end range). Since the order was made on 9/20/24, only forecasts that were created prior to that date should be considered, and I want the most recent one of this subset.

Then, I want to continue this for each subsequent order. An order was placed to be arrived on 9/29/24, created on 9/23/24, so forI've tried what seems like everything but I can't get this to work. Any help is appreciated. the actual day of 9/29/24, I only want to consider the most recent forecast less than 9/23/24. In this case 9/21/24.

This should continue indefinitely, past the last order.

I've tried what seems like everything but I can't get this to work properly.


Solution

  • Code

    Filter the DataFrame by identifying rows that occur before the first LOAD_ARRIVAL_DATE or do not represent the most recent forecast.

    First, it converts the DAILY_DATE and LOAD_ARRIVAL_DATE columns to datetime. Then, it checks to see if DAILY_DATE are earlier than the first arrival date. Then, it ranks FCST_CREATE_DATE within each DAILY_DATE group and excludes rows that are not the most recent forecast.

    dates = pd.to_datetime(df['DAILY_DATE'], errors='coerce')
    dates_arr = pd.to_datetime(df['LOAD_ARRIVAL_DATE'], errors='coerce')
    cond1 = dates < dates_arr.min()
    cond2 = df.groupby(['DAILY_DATE'])['FCST_CREATE_DATE'].rank(method='min').eq(1)
    out = df[cond1 | cond2]