I have a table in the general structure below:
example code:
import pandas as pd
data = {
'ID': [117, 117, 117, 117, 117, 117],
'Date': ['2023-11-14', '2024-01-25', '2024-02-01', '2024-02-04', '2024-02-11', '2024-03-04']
}
df = pd.DataFrame(data)
What I would like to do is, compare the first row to the next row, until the difference between the dates meets some threshold, say 30 days. Then, once that row meets the threshold, I'd like to then test the next row against the subsequent row. It would look like this:
Result, using threshold of 30:
So to reiterate, its comparing the FIRST row to subsequent rows until some threshold is met. Then the count starts over at the first rep after that within the group to subsequent rows within the group.
I've tried this but so far I haven't been able to get the right rowwise comparison to even implement the thresholding bit.
thres = 30
i = 0
control = True
for record in df['Date']:
if record > thres:
print(i, 'in position!', i)
control = False
i += 1
Update
I want to do the above with groups based on ID.
example code:
import pandas as pd
data = {
"ID": [117, 117, 117, 117, 117, 117, 118, 118, 118, 118, 118, 118],
"Date": ["2023-11-14", "2024-01-25", "2024-02-01", "2024-02-04", "2024-02-11", "2024-03-04",
"2024-01-02", "2024-01-28", "2024-02-04", "2024-02-18", "2024-03-11", "2024-06-05"]
}
df = pd.DataFrame(data)
desired output:
Code
maybe need loop
# example code
import pandas as pd
data = {'ID': [117, 117, 117, 117, 117, 117], 'Date': ['2023-11-14', '2024-01-25', '2024-02-01', '2024-02-04', '2024-02-11', '2024-03-04']}
df = pd.DataFrame(data)
# make variable dates (convert dtype to datetime)
dates = pd.to_datetime(df['Date'])
# chk 30 days (you can change your target by changing variable thresh)
thresh = 30
# date is compared as variable ref
ref = dates.iloc[0]
# variable result is list to hold the flags. First element is 1
result = [1]
# append flag(0 or 1) to result variable based on 30 days (loop)
for date in dates.iloc[1:]:
if (date - ref).days >= thresh:
result.append(1) # append 1 as a flag if over 30 days
ref = date # change refrence date
else:
result.append(0) # append 0 as a flag unless over 30 days
# make flag column using result variable
out = df.assign(flag=result)
out
ID Date flag
0 117 2023-11-14 1
1 117 2024-01-25 1
2 117 2024-02-01 0
3 117 2024-02-04 0
4 117 2024-02-11 0
5 117 2024-03-04 1
chk with new sample
df = (pd.DataFrame(pd.date_range('2024-01-01', '2024-04-01'), columns=['Date'])
.sample(10, random_state=3).sort_values('Date').reset_index(drop=True)
)
If apply the above code to the new sample(df), you get the following result
out:
Date Date flag
0 2024-01-10 0 2024-01-10 1
1 2024-02-15 1 2024-02-15 1
2 2024-02-22 2 2024-02-22 0
3 2024-02-25 3 2024-02-25 0
4 2024-02-28 -> 4 2024-02-28 0
5 2024-03-09 5 2024-03-09 0
6 2024-03-11 6 2024-03-11 0
7 2024-03-25 7 2024-03-25 1
8 2024-03-26 8 2024-03-26 0
9 2024-03-31 9 2024-03-31 0
Answer of updated question
Replace the above with a custom function and apply the custom function with groupby
+ apply
.
# example code
import pandas as pd
data = {"ID": [117, 117, 117, 117, 117, 117, 118, 118, 118, 118, 118, 118], "Date": ["2023-11-14", "2024-01-25", "2024-02-01", "2024-02-04", "2024-02-11", "2024-03-04","2024-01-02", "2024-01-28", "2024-02-04", "2024-02-18", "2024-03-11", "2024-06-05"]}
df = pd.DataFrame(data)
# make custom function
def get_flag(d, thresh=30):
dates = pd.to_datetime(d['Date'])
ref = dates.iloc[0]
result = [1]
for date in dates.iloc[1:]:
if (date - ref).days >= thresh:
result.append(1)
ref = date
else:
result.append(0)
return d.assign(flag=result)
# groupby + apply + custom function
out = df.groupby('ID', group_keys=False).apply(get_flag)
out:
ID Date flag
0 117 2023-11-14 1
1 117 2024-01-25 1
2 117 2024-02-01 0
3 117 2024-02-04 0
4 117 2024-02-11 0
5 117 2024-03-04 1
6 118 2024-01-02 1
7 118 2024-01-28 0
8 118 2024-02-04 1
9 118 2024-02-18 0
10 118 2024-03-11 1
11 118 2024-06-05 1