I have been going round in circles with this and haven't been able to figure it out.
Suppose I have the following dataframe:
df = pd.DataFrame({
"person_id": ["1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3"],
"event": ["Alert1", "Alert1", "Alert1", "Alert2", "Alert1", "Alert1", "Alert1", "Alert2", "Alert2", "Alert2", "Alert2", "Alert2"],
"mode": ["Manual", "Manual", "Auto", "Manual", "Auto", "Auto", "Auto", "Manual", "Manual", "Manual", "Auto", "Manual"],
"date": ["2020-01-01", "2020-01-01", "2020-01-03", "2020-01-03", "2020-01-03", "2020-01-03", "2020-01-04", "2020-01-04", "2020-01-04", "2020-01-04", "2020-01-05", "2020-01-05"]
}
)
df
index | person_id | event | mode | date |
---|---|---|---|---|
0 | 1 | Alert1 | Manual | 2020-01-01 |
1 | 1 | Alert1 | Manual | 2020-01-01 |
2 | 1 | Alert1 | Auto | 2020-01-03 |
3 | 1 | Alert2 | Manual | 2020-01-03 |
4 | 2 | Alert1 | Auto | 2020-01-03 |
5 | 2 | Alert1 | Auto | 2020-01-03 |
6 | 2 | Alert1 | Auto | 2020-01-04 |
7 | 2 | Alert2 | Manual | 2020-01-04 |
8 | 3 | Alert2 | Manual | 2020-01-04 |
9 | 3 | Alert2 | Manual | 2020-01-04 |
10 | 3 | Alert2 | Auto | 2020-01-05 |
11 | 3 | Alert2 | Manual | 2020-01-05 |
What I want is the count of each possible combination per possible day (the minimum date would be the first date appearing in the dataset, in this case 2020-01-01 and the maximum date would be the last date appearing in the dataset, in this case 2020-01-05). For example, in the case of the df above, the output would look like this:
index | person_id | event | mode | date | count |
---|---|---|---|---|---|
0 | 1 | Alert1 | Manual | 2020-01-01 | 2 |
1 | 1 | Alert1 | Auto | 2020-01-01 | 0 |
2 | 1 | Alert2 | Manual | 2020-01-01 | 0 |
3 | 1 | Alert2 | Auto | 2020-01-01 | 0 |
4 | 1 | Alert1 | Manual | 2020-01-02 | 0 |
5 | 1 | Alert1 | Auto | 2020-01-02 | 0 |
6 | 1 | Alert2 | Manual | 2020-01-02 | 0 |
7 | 1 | Alert2 | Auto | 2020-01-02 | 0 |
8 | 1 | Alert1 | Manual | 2020-01-03 | 0 |
9 | 1 | Alert1 | Auto | 2020-01-03 | 1 |
10 | 1 | Alert2 | Manual | 2020-01-03 | 1 |
11 | 1 | Alert2 | Auto | 2020-01-03 | 0 |
12 | 1 | Alert1 | Manual | 2020-01-04 | 0 |
13 | 1 | Alert1 | Auto | 2020-01-04 | 0 |
14 | 1 | Alert2 | Manual | 2020-01-04 | 0 |
15 | 1 | Alert2 | Auto | 2020-01-04 | 0 |
16 | 1 | Alert1 | Manual | 2020-01-05 | 0 |
17 | 1 | Alert1 | Auto | 2020-01-05 | 0 |
18 | 1 | Alert2 | Manual | 2020-01-05 | 0 |
19 | 1 | Alert2 | Auto | 2020-01-05 | 0 |
20 | 2 | Alert1 | Manual | 2020-01-01 | 0 |
21 | 2 | Alert1 | Auto | 2020-01-01 | 0 |
22 | 2 | Alert2 | Manual | 2020-01-01 | 0 |
23 | 2 | Alert2 | Auto | 2020-01-01 | 0 |
24 | 2 | Alert1 | Manual | 2020-01-02 | 0 |
25 | 2 | Alert1 | Auto | 2020-01-02 | 0 |
26 | 2 | Alert2 | Manual | 2020-01-02 | 0 |
27 | 2 | Alert2 | Auto | 2020-01-02 | 0 |
28 | 2 | Alert1 | Manual | 2020-01-03 | 0 |
29 | 2 | Alert1 | Auto | 2020-01-03 | 2 |
30 | 2 | Alert2 | Manual | 2020-01-03 | 0 |
31 | 2 | Alert2 | Auto | 2020-01-03 | 0 |
32 | 2 | Alert1 | Manual | 2020-01-04 | 0 |
33 | 2 | Alert1 | Auto | 2020-01-04 | 1 |
34 | 2 | Alert2 | Manual | 2020-01-04 | 1 |
35 | 2 | Alert2 | Auto | 2020-01-04 | 0 |
36 | 2 | Alert1 | Manual | 2020-01-05 | 0 |
37 | 2 | Alert1 | Auto | 2020-01-05 | 0 |
38 | 2 | Alert2 | Manual | 2020-01-05 | 0 |
39 | 2 | Alert2 | Auto | 2020-01-05 | 0 |
40 | 3 | Alert1 | Manual | 2020-01-01 | 0 |
41 | 3 | Alert1 | Auto | 2020-01-01 | 0 |
42 | 3 | Alert2 | Manual | 2020-01-01 | 0 |
43 | 3 | Alert2 | Auto | 2020-01-01 | 0 |
44 | 3 | Alert1 | Manual | 2020-01-02 | 0 |
45 | 3 | Alert1 | Auto | 2020-01-02 | 0 |
46 | 3 | Alert2 | Manual | 2020-01-02 | 0 |
47 | 3 | Alert2 | Auto | 2020-01-02 | 0 |
48 | 3 | Alert1 | Manual | 2020-01-03 | 0 |
49 | 3 | Alert1 | Auto | 2020-01-03 | 0 |
50 | 3 | Alert2 | Manual | 2020-01-03 | 0 |
51 | 3 | Alert2 | Auto | 2020-01-03 | 0 |
52 | 3 | Alert1 | Manual | 2020-01-04 | 0 |
53 | 3 | Alert1 | Auto | 2020-01-04 | 0 |
54 | 3 | Alert2 | Manual | 2020-01-04 | 2 |
55 | 3 | Alert2 | Auto | 2020-01-04 | 0 |
56 | 3 | Alert1 | Manual | 2020-01-05 | 0 |
57 | 3 | Alert1 | Auto | 2020-01-05 | 0 |
58 | 3 | Alert2 | Manual | 2020-01-05 | 1 |
59 | 3 | Alert2 | Auto | 2020-01-05 | 1 |
Importantly, each combination should have the exact same number of unique datetimes at the end, so if I run the following line of code:
df_summarized.groupby(['person_id', 'event', 'mode'])['date'].nunique().reset_index()
The result should clearly show that each combination has 5 unique days of data.
How could I achieve this?
Thanks in advance
IIUC, what you need to do is first create a finite set of all possible combinations, and then count their occurences.
import pandas as pd
import numpy as np
from itertools import product
# Create the original DataFrame
df = pd.DataFrame({
"person_id": ["1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", "3"],
"event": ["Alert1", "Alert1", "Alert1", "Alert2", "Alert1", "Alert1", "Alert1", "Alert2", "Alert2", "Alert2", "Alert2", "Alert2"],
"mode": ["Manual", "Manual", "Auto", "Manual", "Auto", "Auto", "Auto", "Manual", "Manual", "Manual", "Auto", "Manual"],
"date": ["2020-01-01", "2020-01-01", "2020-01-03", "2020-01-03", "2020-01-03", "2020-01-03", "2020-01-04", "2020-01-04", "2020-01-04", "2020-01-04", "2020-01-05", "2020-01-05"]
})
df['date'] = pd.to_datetime(df['date'])
person_ids = df['person_id'].unique()
events = df['event'].unique()
modes = df['mode'].unique()
dates = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
all_combinations = pd.DataFrame(list(product(person_ids, events, modes, dates)), columns=['person_id', 'event', 'mode', 'date'])
count_df = df.groupby(['person_id', 'event', 'mode', 'date']).size().reset_index(name='count')
result = all_combinations.merge(count_df, on=['person_id', 'event', 'mode', 'date'], how='left').fillna(0)
result.reset_index(drop=True, inplace