data = [
{'DateTime': pd.to_datetime('2023-09-26 09:36:47'), 'Action': 'OUT', 'LicenseCode': 'APP01', 'User': 'USER01'},
{'DateTime': pd.to_datetime('2023-09-26 11:01:55'), 'Action': 'IN', 'LicenseCode': 'APP01', 'User': 'USER01'},
{'DateTime': pd.to_datetime('2023-09-27 10:01:20'), 'Action': 'OUT', 'LicenseCode': 'APP02', 'User': 'USER02'},
{'DateTime': pd.to_datetime('2023-09-27 14:46:09'), 'Action': 'IN', 'LicenseCode': 'APP02', 'User': 'USER02'},
{'DateTime': pd.to_datetime('2023-09-27 13:27:57'), 'Action': 'OUT', 'LicenseCode': 'APP02', 'User': 'USER01'},
{'DateTime': pd.to_datetime('2023-09-27 18:03:09'), 'Action': 'IN', 'LicenseCode': 'APP02', 'User': 'USER01'}
]
df = pd.DataFrame(data)
df.groupby([pd.Grouper(key='DateTime', freq='D'),'User', 'LicenseCode'])
Hi guys, I'd like to create a new elapsed time column (in hours) for licenses from the sample data of license checkout. This is just a snippet of 500k+ lines.
Conditions for new column:
elapsed time will be grouped by day, user & license code as only elapsed time for each day will be tabulated for each unique user & licensecode
if df['Action'] == OUT, elapsed time = 0
if (df['Action'] == OUT) & (df['Action'].shift() == OUT), elapsed time = df[DateTime].diff()
if 'Action' == IN, elapsed time = calculate cumulative time diff of previous 'OUT' line (they work in pairs OUT then IN)
there are instances of OUT,OUT,IN,IN (order of 'Action' rows), most ideal will be OUT,IN,OUT,IN
I've hit a wall when I try to do a groupby agg with a custom function as I can't reference the columns by e.g. df['Action'].
Example of Desired Output :
DateTime | Action | LicenseCode | User | Elapsed_Time |
---|---|---|---|---|
2023-09-26 09:36:47 | OUT | APP01 | USER01 | 0.00 |
2023-09-26 11:36:47 | IN | APP01 | USER01 | 2.00 |
2023-09-27 09:30:00 | OUT | APP02 | USER02 | 0.00 |
2023-09-27 14:30:00 | IN | APP02 | USER02 | 5.00 |
2023-09-27 15:30:00 | OUT | APP02 | USER02 | 5.00 |
2023-09-27 15:45:00 | IN | APP02 | USER02 | 5.25 |
2023-09-27 16:10:00 | OUT | APP01 | USER02 | 0.00 |
2023-09-27 16:40:00 | IN | APP01 | USER02 | 0.50 |
2023-09-27 17:00:00 | OUT | APP01 | USER02 | 0.50 |
2023-09-27 17:12:00 | OUT | APP01 | USER02 | 0.70 |
2023-09-27 17:42:00 | IN | APP01 | USER02 | 1.20 |
2023-09-27 17:52:00 | IN | APP01 | USER02 | 1.37 |
Any thoughts will be greatly appreciated!
Thanks to Panda Kim, I've managed to retrieve the desired output. Here is the code modification from @Panda Kim's input.
Code:
data = [
{'DateTime': pd.to_datetime('2023-09-26 09:36:47'), 'Action': 'OUT', 'LicenseCode': 'APP01', 'User': 'USER01'},
{'DateTime': pd.to_datetime('2023-09-26 11:36:47'), 'Action': 'IN', 'LicenseCode': 'APP01', 'User': 'USER01'},
{'DateTime': pd.to_datetime('2023-09-27 09:30:00'), 'Action': 'OUT', 'LicenseCode': 'APP02', 'User': 'USER02'},
{'DateTime': pd.to_datetime('2023-09-27 14:30:00'), 'Action': 'IN', 'LicenseCode': 'APP02', 'User': 'USER02'},
{'DateTime': pd.to_datetime('2023-09-27 15:30:00'), 'Action': 'OUT', 'LicenseCode': 'APP02', 'User': 'USER02'},
{'DateTime': pd.to_datetime('2023-09-27 15:45:00'), 'Action': 'IN', 'LicenseCode': 'APP02', 'User': 'USER02'},
{'DateTime': pd.to_datetime('2023-09-27 16:10:00'), 'Action': 'OUT', 'LicenseCode': 'APP01', 'User': 'USER02'},
{'DateTime': pd.to_datetime('2023-09-27 16:40:00'), 'Action': 'IN', 'LicenseCode': 'APP01', 'User': 'USER02'},
{'DateTime': pd.to_datetime('2023-09-27 17:00:00'), 'Action': 'OUT', 'LicenseCode': 'APP01', 'User': 'USER02'},
{'DateTime': pd.to_datetime('2023-09-27 17:12:00'), 'Action': 'OUT', 'LicenseCode': 'APP01', 'User': 'USER02'},
{'DateTime': pd.to_datetime('2023-09-27 17:42:00'), 'Action': 'IN', 'LicenseCode': 'APP01', 'User': 'USER02'},
{'DateTime': pd.to_datetime('2023-09-27 17:52:00'), 'Action': 'IN', 'LicenseCode': 'APP01', 'User': 'USER02'}
]
df = pd.DataFrame(data)
cond = (df['Action'].eq('OUT')) & (df['Action'].shift(+1).eq('IN'))
grp = cond.groupby([df['LicenseCode'],df['User']]).cumsum()
df['elapsed_time'] = df.groupby([pd.Grouper(key='DateTime', freq='D'), grp, 'User', 'LicenseCode'])['DateTime'].diff()
df['elapsed_time'] = df['elapsed_time'].fillna(pd.Timedelta(seconds=0))
df['elapsed_time'] = (df.groupby([pd.Grouper(key='DateTime', freq='D'),'User', 'LicenseCode']).elapsed_time.cumsum() \
.dt.total_seconds()/3600) \
.round(2)
print(df)