I have a dataframe that looks like below
Date | Name | In/Out | Time |
---|---|---|---|
2024-01-01 | Homer | IN | 07:10 |
2024-01-01 | Homer | OUT | 09:30 |
2024-01-01 | Homer | IN | 10:00 |
2024-01-01 | Homer | OUT | 16:00 |
2024-01-01 | Marge | In | 07:15 |
2024-01-01 | Marge | Out | 16:10 |
2024-01-01 | Bart | In | 07:14 |
2024-01-01 | Bart | Out | 10:00 |
2024-01-01 | Bart | In | 10:15 |
2024-01-01 | Bart | Out | 12:00 |
2024-01-01 | Bart | In | 12:30 |
2024-01-01 | Bart | Out | 17:00 |
My end goal is to calculate the total timedelta (total_seconds()) of intermediate IN/OUT
times only, as highlighted. Expected output below.
Date | Name | TimeDelta |
---|---|---|
2024-01-01 | Homer | 1800 |
2024-01-01 | Bart | 2700 |
I haven't been to sure where to start with this one and cannot find any examples anywhere else?
example dataframe:
example_df = pd.DataFrame([
['2024-01-01', 'Homer', 'in', '07:30'],
['2024-01-01', 'Homer', 'out' ,'09:00'],
['2024-01-01', 'Homer', 'in' ,'09:30'],
['2024-01-01', 'Homer', 'out' ,'16:00'],
['2024-01-01', 'Marge', 'in' , '06:20'],
['2024-01-01', 'Marge', 'out' ,'16:00'],
['2024-01-01', 'Bart', 'in' ,'07:10'],
['2024-01-01', 'Bart', 'out' ,'08:00'],
['2024-01-01', 'Bart', 'in' ,'08:20'],
['2024-01-01', 'Bart', 'out' ,'17:00'],
['2024-01-01', 'Barney', 'in' ,'08:10'],
['2024-01-01', 'Lisa', 'in' ,'08:05'],
['2024-01-01', 'Lisa', 'out' ,'14:00'],
['2024-01-01', 'Lisa', 'in' ,'14:15'],
['2024-01-01', 'Lisa', 'out' ,'18:10'],
['2024-01-01', 'Millhouse', 'out' ,'19:10'],
['2024-02-01', 'Homer', 'in', '07:30'],
['2024-02-01', 'Homer', 'out' ,'09:00'],
['2024-02-01', 'Marge', 'in' , '06:30'],
['2024-02-01', 'Marge', 'out' ,'09:10'],
['2024-02-01', 'Marge', 'in' ,'10:10'],
['2024-02-01', 'Marge', 'out' ,'16:10'],
['2024-02-01', 'Bart', 'in' ,'07:10'],
['2024-02-01', 'Bart', 'out' ,'15:00'],
['2024-02-01', 'Barney', 'in' ,'08:10'],
['2024-02-01', 'Lisa', 'in' ,'08:05'],
['2024-02-01', 'Lisa', 'out' ,'16:00'],
['2024-02-01', 'Millhouse', 'in' ,'08:10'],
['2024-02-01', 'Millhouse', 'in' ,'08:10'],
['2024-02-01', 'Millhouse', 'in' ,'16:15']],
columns=['Date', 'Name', 'In/Out', 'Time'])
Assuming Time is sorted within a group, that the first In/Out is always In, and that In/Out are always alternating.
You could convert the times to_datetime
, then use groupby.apply
to compute the diff
, ignore the first/last value (with iloc
) and sum
the "IN" timedeltas before converting to total_seconds
:
# cleanup IN/OUT format
df['In/Out'] = df['In/Out'].str.upper()
out = (df
.assign(dt=pd.to_datetime(df['Time'], format='%H:%M'))
.groupby(['Date', 'Name'])
.apply(lambda g:
g['dt'].diff().iloc[1:-1]
[g['In/Out'].eq('IN')]
.sum().total_seconds())
.reset_index(name='TimeDelta')
.query('TimeDelta>0') # optional: remove rows with null TimeDelta
)
Output:
Date Name TimeDelta
1 2024-01-01 J Bloggs 1800.0
2 2024-01-01 M Simpson 2700.0
NB. If any of the initial assumptions is incorrect, you just need to pre-process the data to sort it and remove invalid rows.
Example:
# cleanup IN/OUT format
df['In/Out'] = df['In/Out'].str.upper()
m1 = df.sort_values(by='Time').groupby(['Date', 'Name'])['In/Out'].shift(-1).ne(df['In/Out'])
out = (df[m1]
.assign(dt=pd.to_datetime(df.loc[m1, 'Time'], format='%H:%M'))
.groupby(['Date', 'Name'])
.apply(lambda g:
g['dt'].diff().iloc[1:-1]
[g['In/Out'].eq('IN')]
.sum().total_seconds())
.reset_index(name='TimeDelta')
.query('TimeDelta>0') # optional: remove rows with null TimeDelta
)
Output:
Date Name TimeDelta
1 2024-01-01 Bart 1200.0
2 2024-01-01 Homer 1800.0
3 2024-01-01 Lisa 900.0
10 2024-02-01 Marge 3600.0