pythonpandasrolling-computationpandas-rolling

Count category duplicates within a time window in new column in python (similar to rolling with value_counts)


I have been trying to solve an exercise for some time and I haven’t been able to do it, I have a dataset containing a list of calls with the topic of the call (in this sample dataset I decided to use ice cream flavors as topics), In the call center they consider that a topic was solved in the first time if the topic was no mentioned in another call using a time window of 72 hours. I need to create a new column in the data frame that counts the number of times the Ice cream flavor in the row was mentioned within a window of 72 hours (count the number of occurrences of an event within a time window).

I saw a solution using get_dummies but this would be inefficient for me since I have more than 300 Ice cream flavors:

pandas `value_counts` on a rolling time window

The following is a sample of the data I have:

2014-01-01 07:21:51 Apple
2014-01-01 10:00:47 Orange
2014-01-01 13:24:58 Banana
2014-01-01 15:05:22 Strawberry
2014-01-01 23:26:55 Lemon
2014-01-02 10:07:15 Orange
2014-01-02 10:57:23 Banana
2014-01-03 06:32:11 Peach
2014-01-03 11:29:02 Orange
2014-01-03 19:07:37 Coconut
2014-01-03 19:39:53 Mango
2014-01-04 00:02:36 Grape
2014-01-04 06:51:53 Cherry
2014-01-04 07:53:01 Strawberry
2014-01-04 08:57:48 Coconut

And this is the expected result:

2014-01-01 07:21:51 Apple   1
2014-01-01 10:00:47 Orange  1
2014-01-01 13:24:58 Banana  1
2014-01-01 15:05:22 Strawberry  1
2014-01-01 23:26:55 Lemon   1
2014-01-02 10:07:15 Orange  2
2014-01-02 10:57:23 Banana  2
2014-01-03 06:32:11 Peach   1
2014-01-03 11:29:02 Orange  3
2014-01-03 19:07:37 Coconut 1
2014-01-03 19:39:53 Mango   1
2014-01-04 00:02:36 Grape   1
2014-01-04 06:51:53 Cherry  1
2014-01-04 07:53:01 Strawberry  2
2014-01-04 08:57:48 Coconut 2

I have found some similar questions, but not quite solving my need:

group by time and other column in pandas

Rolling count pandas for categorical variables using time

In pandas how to calculate 'Countif' on a moving window basis?


Solution

  • The added column count serves as a temporary helper so we can sum over it.

    Setup:

    df = pd.read_csv("data.csv")
    df["date"] = pd.to_datetime(df["date"])
    df.set_index("date", inplace=True)
    df["count"] = 1 
    

    Usage:

    result = df.groupby("flavor").rolling("72H").sum().reset_index()
    df = df.merge(result, on=["flavor", "date"], suffixes=("_old", ""))
    del df["count_old"]
    df.to_markdown()
    

    Outputs:

    |    | flavor     | date                |   count |
    |---:|:-----------|:--------------------|--------:|
    |  0 | Apple      | 2014-01-01 07:21:51 |       1 |
    |  1 | Orange     | 2014-01-01 10:00:47 |       1 |
    |  2 | Banana     | 2014-01-01 13:24:58 |       1 |
    |  3 | Strawberry | 2014-01-01 15:05:22 |       1 |
    |  4 | Lemon      | 2014-01-01 23:26:55 |       1 |
    |  5 | Orange     | 2014-01-02 10:07:15 |       2 |
    |  6 | Banana     | 2014-01-02 10:57:23 |       2 |
    |  7 | Peach      | 2014-01-03 06:32:11 |       1 |
    |  8 | Orange     | 2014-01-03 11:29:02 |       3 |
    |  9 | Coconut    | 2014-01-03 19:07:37 |       1 |
    | 10 | Mango      | 2014-01-03 19:39:53 |       1 |
    | 11 | Grape      | 2014-01-04 00:02:36 |       1 |
    | 12 | Cherry     | 2014-01-04 06:51:53 |       1 |
    | 13 | Strawberry | 2014-01-04 07:53:01 |       2 |
    | 14 | Coconut    | 2014-01-04 08:57:48 |       2 |