import pandas as pd
data = {
'ITEM': [1]*22,
'WAREHOUSE': [1000]*22,
'DAILY_DATE': ['9/20/2024', '9/20/2024', '9/21/2024', '9/21/2024', '9/22/2024',
'9/22/2024', '9/23/2024', '9/23/2024', '9/24/2024', '9/24/2024',
'9/25/2024', '9/25/2024', '9/26/2024', '9/26/2024', '9/27/2024',
'9/27/2024', '9/28/2024', '9/28/2024', '9/29/2024', '9/29/2024', '9/30/2024',
'9/30/2024'],
'LOAD_ARRIVAL_DATE': ['9/26/2024', '9/26/2024', '', '', '', '', '9/29/2024', '9/29/2024',
'9/30/2024', '9/30/2024', '', '', '', '', '', '', '', '', '', '', '', ''],
'FCST_CREATE_DATE': ['9/14/2024', '9/7/2024', '9/14/2024', '9/7/2024', '9/21/2024',
'9/14/2024', '9/21/2024', '9/14/2024', '9/21/2024', '9/14/2024',
'9/21/2024', '9/14/2024', '9/21/2024', '9/14/2024', '9/21/2024',
'9/14/2024', '9/21/2024', '9/14/2024', '9/21/2024','9/28/2024', '9/28/2024',
'9/21/2024'],
'FCST_QTY': [17.06, 10.4624, 12.1876, 16.2801, 8.8821, 8.3799, 11.1043, 10.7751,
8.8821, 5.9846, 5.5555, 8.3799, 11.1043, 16.7597, 11.1043, 9.5809,
11.1043, 8.3799, 130.02, 157.2252, 48.6241, 235.9502]
}
df = pd.DataFrame(data)
I have a data frame, df, that gives daily forecast data for an item at a warehouse. Each date is prescribed two forecast quantities, one representing the forecast made 1 week prior, and another 2 weeks prior, and this corresponds to FCST_CREATE_DT. LOAD_ARRIVAL_DATE describes if a future shipment order was placed that day.
This is a bit hard for me to verbally describe, but I essentially want to filter my data frame so that I only keep the most recent forecast for when an order was placed. Basically, putting myself back in time, and only using the forecasts that were available at that time.
Visually, the table looks as such:
ITEM | WAREHOUSE | DAILY_DATE | LOAD_ARRIVAL_DATE | FCST_CREATE_DATE | FCST_QTY |
---|---|---|---|---|---|
1 | 1000 | 9/20/2024 | 9/26/2024 | 9/14/2024 | 17.06 |
1 | 1000 | 9/20/2024 | 9/26/2024 | 9/7/2024 | 10.4624 |
1 | 1000 | 9/21/2024 | 9/14/2024 | 12.1876 | |
1 | 1000 | 9/21/2024 | 9/7/2024 | 16.2801 | |
1 | 1000 | 9/22/2024 | 9/21/2024 | 8.8821 | |
1 | 1000 | 9/22/2024 | 9/14/2024 | 8.3799 | |
1 | 1000 | 9/23/2024 | 9/29/2024 | 9/21/2024 | 11.1043 |
1 | 1000 | 9/23/2024 | 9/29/2024 | 9/14/2024 | 10.7751 |
1 | 1000 | 9/24/2024 | 9/30/2024 | 9/21/2024 | 8.8821 |
1 | 1000 | 9/24/2024 | 9/30/2024 | 9/14/2024 | 5.9846 |
1 | 1000 | 9/25/2024 | 9/21/2024 | 5.5555 | |
1 | 1000 | 9/25/2024 | 9/14/2024 | 8.3799 | |
1 | 1000 | 9/26/2024 | 9/21/2024 | 11.1043 | |
1 | 1000 | 9/26/2024 | 9/14/2024 | 16.7597 | |
1 | 1000 | 9/27/2024 | 9/21/2024 | 11.1043 | |
1 | 1000 | 9/27/2024 | 9/14/2024 | 9.5809 | |
1 | 1000 | 9/28/2024 | 9/21/2024 | 11.1043 | |
1 | 1000 | 9/28/2024 | 9/14/2024 | 8.3799 | |
1 | 1000 | 9/29/2024 | 9/21/2024 | 130.02 | |
1 | 1000 | 9/29/2024 | 9/28/2024 | 157.2252 | |
1 | 1000 | 9/30/2024 | 9/28/2024 | 48.6241 | |
1 | 1000 | 9/30/2024 | 9/21/2024 | 235.9502 |
After, removing the unnecessary rows, I want it to look like:
ITEM | WAREHOUSE | DAILY_DATE | LOAD_ARRIVAL_DATE | FCST_CREATE_DATE | FCST_QTY |
---|---|---|---|---|---|
1 | 1000 | 9/20/2024 | 9/26/2024 | 9/14/2024 | 17.06 |
1 | 1000 | 9/20/2024 | 9/26/2024 | 9/7/2024 | 10.4624 |
1 | 1000 | 9/21/2024 | 9/14/2024 | 12.1876 | |
1 | 1000 | 9/21/2024 | 9/7/2024 | 16.2801 | |
1 | 1000 | 9/22/2024 | 9/21/2024 | 8.8821 | |
1 | 1000 | 9/22/2024 | 9/14/2024 | 8.3799 | |
1 | 1000 | 9/23/2024 | 9/29/2024 | 9/21/2024 | 11.1043 |
1 | 1000 | 9/23/2024 | 9/29/2024 | 9/14/2024 | 10.7751 |
1 | 1000 | 9/24/2024 | 9/30/2024 | 9/21/2024 | 8.8821 |
1 | 1000 | 9/24/2024 | 9/30/2024 | 9/14/2024 | 5.9846 |
1 | 1000 | 9/25/2024 | 9/21/2024 | 5.5555 | |
1 | 1000 | 9/25/2024 | 9/14/2024 | 8.3799 | |
1 | 1000 | 9/26/2024 | 9/14/2024 | 16.7597 | |
1 | 1000 | 9/27/2024 | 9/14/2024 | 9.5809 | |
1 | 1000 | 9/28/2024 | 9/14/2024 | 8.3799 | |
1 | 1000 | 9/29/2024 | 9/21/2024 | 130.02 | |
11 | 1000 | 9/30/2024 | 9/21/2024 | 235.9502 |
Any rows where DAILY_DATE is before the first instance of LOAD_ARRIVAL_DT can be ignored. In this case, since the first instance of LOAD_ARRIVAL_DT is 9/26/2024, all rows where DAILY_DATE is less than that are ignored. From the date range of the first instance of LOAD_ARRIVAL_DT to the next instance of LOAD_ARRIVAL_DT, I want to consider only the most recent forecast that was retroactively available. So the date range would be 9/26/24 - 9/29/24. (Not inclusive of the end range). Since the order was made on 9/20/24, only forecasts that were created prior to that date should be considered, and I want the most recent one of this subset.
Then, I want to continue this for each subsequent order. An order was placed to be arrived on 9/29/24, created on 9/23/24, so forI've tried what seems like everything but I can't get this to work. Any help is appreciated. the actual day of 9/29/24, I only want to consider the most recent forecast less than 9/23/24. In this case 9/21/24.
This should continue indefinitely, past the last order.
I've tried what seems like everything but I can't get this to work properly.
Code
Filter the DataFrame by identifying rows that occur before the first LOAD_ARRIVAL_DATE or do not represent the most recent forecast.
First, it converts the DAILY_DATE
and LOAD_ARRIVAL_DATE
columns to datetime. Then, it checks to see if DAILY_DATE
are earlier than the first arrival date. Then, it ranks FCST_CREATE_DATE
within each DAILY_DATE
group and excludes rows that are not the most recent forecast.
dates = pd.to_datetime(df['DAILY_DATE'], errors='coerce')
dates_arr = pd.to_datetime(df['LOAD_ARRIVAL_DATE'], errors='coerce')
cond1 = dates < dates_arr.min()
cond2 = df.groupby(['DAILY_DATE'])['FCST_CREATE_DATE'].rank(method='min').eq(1)
out = df[cond1 | cond2]