pythonpandasdatetimeresample

Resample pandas df with multiple groupbys so each condition has the same number of total days of data


I have been going round in circles with this and haven't been able to figure it out.

Suppose I have the following dataframe:

df = pd.DataFrame({
    "person_id": ["1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3"],
    "event": ["Alert1", "Alert1", "Alert1", "Alert2", "Alert1", "Alert1", "Alert1", "Alert2", "Alert2", "Alert2", "Alert2", "Alert2"],
    "mode": ["Manual", "Manual", "Auto", "Manual", "Auto", "Auto", "Auto", "Manual", "Manual", "Manual", "Auto", "Manual"],
    "date": ["2020-01-01", "2020-01-01", "2020-01-03", "2020-01-03", "2020-01-03", "2020-01-03", "2020-01-04", "2020-01-04", "2020-01-04", "2020-01-04", "2020-01-05", "2020-01-05"]
}
)

df
index person_id event mode date
0 1 Alert1 Manual 2020-01-01
1 1 Alert1 Manual 2020-01-01
2 1 Alert1 Auto 2020-01-03
3 1 Alert2 Manual 2020-01-03
4 2 Alert1 Auto 2020-01-03
5 2 Alert1 Auto 2020-01-03
6 2 Alert1 Auto 2020-01-04
7 2 Alert2 Manual 2020-01-04
8 3 Alert2 Manual 2020-01-04
9 3 Alert2 Manual 2020-01-04
10 3 Alert2 Auto 2020-01-05
11 3 Alert2 Manual 2020-01-05

What I want is the count of each possible combination per possible day (the minimum date would be the first date appearing in the dataset, in this case 2020-01-01 and the maximum date would be the last date appearing in the dataset, in this case 2020-01-05). For example, in the case of the df above, the output would look like this:

index person_id event mode date count
0 1 Alert1 Manual 2020-01-01 2
1 1 Alert1 Auto 2020-01-01 0
2 1 Alert2 Manual 2020-01-01 0
3 1 Alert2 Auto 2020-01-01 0
4 1 Alert1 Manual 2020-01-02 0
5 1 Alert1 Auto 2020-01-02 0
6 1 Alert2 Manual 2020-01-02 0
7 1 Alert2 Auto 2020-01-02 0
8 1 Alert1 Manual 2020-01-03 0
9 1 Alert1 Auto 2020-01-03 1
10 1 Alert2 Manual 2020-01-03 1
11 1 Alert2 Auto 2020-01-03 0
12 1 Alert1 Manual 2020-01-04 0
13 1 Alert1 Auto 2020-01-04 0
14 1 Alert2 Manual 2020-01-04 0
15 1 Alert2 Auto 2020-01-04 0
16 1 Alert1 Manual 2020-01-05 0
17 1 Alert1 Auto 2020-01-05 0
18 1 Alert2 Manual 2020-01-05 0
19 1 Alert2 Auto 2020-01-05 0
20 2 Alert1 Manual 2020-01-01 0
21 2 Alert1 Auto 2020-01-01 0
22 2 Alert2 Manual 2020-01-01 0
23 2 Alert2 Auto 2020-01-01 0
24 2 Alert1 Manual 2020-01-02 0
25 2 Alert1 Auto 2020-01-02 0
26 2 Alert2 Manual 2020-01-02 0
27 2 Alert2 Auto 2020-01-02 0
28 2 Alert1 Manual 2020-01-03 0
29 2 Alert1 Auto 2020-01-03 2
30 2 Alert2 Manual 2020-01-03 0
31 2 Alert2 Auto 2020-01-03 0
32 2 Alert1 Manual 2020-01-04 0
33 2 Alert1 Auto 2020-01-04 1
34 2 Alert2 Manual 2020-01-04 1
35 2 Alert2 Auto 2020-01-04 0
36 2 Alert1 Manual 2020-01-05 0
37 2 Alert1 Auto 2020-01-05 0
38 2 Alert2 Manual 2020-01-05 0
39 2 Alert2 Auto 2020-01-05 0
40 3 Alert1 Manual 2020-01-01 0
41 3 Alert1 Auto 2020-01-01 0
42 3 Alert2 Manual 2020-01-01 0
43 3 Alert2 Auto 2020-01-01 0
44 3 Alert1 Manual 2020-01-02 0
45 3 Alert1 Auto 2020-01-02 0
46 3 Alert2 Manual 2020-01-02 0
47 3 Alert2 Auto 2020-01-02 0
48 3 Alert1 Manual 2020-01-03 0
49 3 Alert1 Auto 2020-01-03 0
50 3 Alert2 Manual 2020-01-03 0
51 3 Alert2 Auto 2020-01-03 0
52 3 Alert1 Manual 2020-01-04 0
53 3 Alert1 Auto 2020-01-04 0
54 3 Alert2 Manual 2020-01-04 2
55 3 Alert2 Auto 2020-01-04 0
56 3 Alert1 Manual 2020-01-05 0
57 3 Alert1 Auto 2020-01-05 0
58 3 Alert2 Manual 2020-01-05 1
59 3 Alert2 Auto 2020-01-05 1

Importantly, each combination should have the exact same number of unique datetimes at the end, so if I run the following line of code:

df_summarized.groupby(['person_id', 'event', 'mode'])['date'].nunique().reset_index()

The result should clearly show that each combination has 5 unique days of data.

How could I achieve this?

Thanks in advance


Solution

  • IIUC, what you need to do is first create a finite set of all possible combinations, and then count their occurences.

    import pandas as pd
    import numpy as np
    from itertools import product
    
    # Create the original DataFrame
    df = pd.DataFrame({
        "person_id": ["1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3"],
        "event": ["Alert1", "Alert1", "Alert1", "Alert2", "Alert1", "Alert1", "Alert1", "Alert2", "Alert2", "Alert2", "Alert2", "Alert2"],
        "mode": ["Manual", "Manual", "Auto", "Manual", "Auto", "Auto", "Auto", "Manual", "Manual", "Manual", "Auto", "Manual"],
        "date": ["2020-01-01", "2020-01-01", "2020-01-03", "2020-01-03", "2020-01-03", "2020-01-03", "2020-01-04", "2020-01-04", "2020-01-04", "2020-01-04", "2020-01-05", "2020-01-05"]
    })
    
    df['date'] = pd.to_datetime(df['date'])
    
    person_ids = df['person_id'].unique()
    events = df['event'].unique()
    modes = df['mode'].unique()
    dates = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
    
    all_combinations = pd.DataFrame(list(product(person_ids, events, modes, dates)), columns=['person_id', 'event', 'mode', 'date'])
    
    count_df = df.groupby(['person_id', 'event', 'mode', 'date']).size().reset_index(name='count')
    
    result = all_combinations.merge(count_df, on=['person_id', 'event', 'mode', 'date'], how='left').fillna(0)
    result.reset_index(drop=True, inplace