pythonpandasdataframe

In Python, how to compare the value to each subsequent value in a row until a condition is met


I have a table in the general structure below:

enter image description here

example code:

import pandas as pd
data = {
    'ID': [117, 117, 117, 117, 117, 117],
    'Date': ['2023-11-14', '2024-01-25', '2024-02-01', '2024-02-04', '2024-02-11', '2024-03-04']
}
df = pd.DataFrame(data)

What I would like to do is, compare the first row to the next row, until the difference between the dates meets some threshold, say 30 days. Then, once that row meets the threshold, I'd like to then test the next row against the subsequent row. It would look like this:

Result, using threshold of 30:

enter image description here

So to reiterate, its comparing the FIRST row to subsequent rows until some threshold is met. Then the count starts over at the first rep after that within the group to subsequent rows within the group.

I've tried this but so far I haven't been able to get the right rowwise comparison to even implement the thresholding bit.

thres = 30

i = 0
control = True
for record in df['Date']:
    if record > thres:
            print(i, 'in position!', i)
        control = False
    i += 1

Update

I want to do the above with groups based on ID.

enter image description here

example code:

import pandas as pd
data = {
    "ID": [117, 117, 117, 117, 117, 117, 118, 118, 118, 118, 118, 118],
    "Date": ["2023-11-14", "2024-01-25", "2024-02-01", "2024-02-04", "2024-02-11", "2024-03-04",
             "2024-01-02", "2024-01-28", "2024-02-04", "2024-02-18", "2024-03-11", "2024-06-05"]
}

df = pd.DataFrame(data)

desired output:

enter image description here


Solution

  • Code

    maybe need loop

    # example code
    import pandas as pd
    data = {'ID': [117, 117, 117, 117, 117, 117], 'Date': ['2023-11-14', '2024-01-25', '2024-02-01', '2024-02-04', '2024-02-11', '2024-03-04']}
    df = pd.DataFrame(data)
    
    # make variable dates (convert dtype to datetime)
    dates = pd.to_datetime(df['Date'])
    
    # chk 30 days (you can change your target by changing variable thresh)
    thresh = 30
    
    # date is compared as variable ref
    ref = dates.iloc[0]
    
    # variable result is list to hold the flags. First element is 1
    result = [1]
    
    # append flag(0 or 1) to result variable based on 30 days (loop)
    for date in dates.iloc[1:]:
        if (date - ref).days >= thresh:
            result.append(1) # append 1 as a flag if over 30 days
            ref = date # change refrence date
        else:
            result.append(0) # append 0 as a flag unless over 30 days
    
    # make flag column using result variable 
    out = df.assign(flag=result)
    

    out

        ID  Date        flag
    0   117 2023-11-14  1
    1   117 2024-01-25  1
    2   117 2024-02-01  0
    3   117 2024-02-04  0
    4   117 2024-02-11  0
    5   117 2024-03-04  1
    

    chk with new sample

    df = (pd.DataFrame(pd.date_range('2024-01-01', '2024-04-01'), columns=['Date'])
            .sample(10, random_state=3).sort_values('Date').reset_index(drop=True)
    )
    

    If apply the above code to the new sample(df), you get the following result

    out:

        Date                 Date       flag
    0   2024-01-10      0   2024-01-10  1
    1   2024-02-15      1   2024-02-15  1
    2   2024-02-22      2   2024-02-22  0
    3   2024-02-25      3   2024-02-25  0
    4   2024-02-28  ->  4   2024-02-28  0
    5   2024-03-09      5   2024-03-09  0
    6   2024-03-11      6   2024-03-11  0
    7   2024-03-25      7   2024-03-25  1
    8   2024-03-26      8   2024-03-26  0
    9   2024-03-31      9   2024-03-31  0
    

    Answer of updated question

    Replace the above with a custom function and apply the custom function with groupby + apply.

    # example code
    import pandas as pd
    data = {"ID": [117, 117, 117, 117, 117, 117, 118, 118, 118, 118, 118, 118], "Date": ["2023-11-14", "2024-01-25", "2024-02-01", "2024-02-04", "2024-02-11", "2024-03-04","2024-01-02", "2024-01-28", "2024-02-04", "2024-02-18", "2024-03-11", "2024-06-05"]}
    df = pd.DataFrame(data)
    
    # make custom function
    def get_flag(d, thresh=30):
        dates = pd.to_datetime(d['Date'])
        ref = dates.iloc[0]
        result = [1]
    
        for date in dates.iloc[1:]:
            if (date - ref).days >= thresh:
                result.append(1)
                ref = date 
            else:
                result.append(0)
        return d.assign(flag=result)
        
    # groupby + apply + custom function    
    out = df.groupby('ID', group_keys=False).apply(get_flag)
    

    out:

         ID        Date  flag
    0   117  2023-11-14     1
    1   117  2024-01-25     1
    2   117  2024-02-01     0
    3   117  2024-02-04     0
    4   117  2024-02-11     0
    5   117  2024-03-04     1
    6   118  2024-01-02     1
    7   118  2024-01-28     0
    8   118  2024-02-04     1
    9   118  2024-02-18     0
    10  118  2024-03-11     1
    11  118  2024-06-05     1