I have a pandas dataframe (I have simplified table with one date showing in example), I want to calculate the average daily amount of time between two news posted by a unique Source
Input
source date time
Investing.com 2017-05-11 08:00:00
Investing.com 2017-05-11 12:00:00
Investing.com 2017-05-11 16:00:00
yahoo.com 2017-05-11 09:00:00
yahoo.com 2017-05-11 12:00:00
yahoo.com 2017-05-11 15:00:00
yahoo.com 2017-05-12 06:00:00
yahoo.com 2017-05-12 12:00:00
yahoo.com 2017-05-12 18:00:00
Desired_output
source date Average_Daily_time
Investing.com 2017-05-11 04:00:00
yahoo.com 2017-05-11 03:00:00
yahoo.com 2017-05-12 06:00:00
My Attempt
I merged the datetime in one timestamp and called it datetime
df.sort_values('datetime').groupby('source')['datetime'].apply(lambda x: x.diff().dt.seconds.mean()/60)
Issue
It calculates average time for all dates combined, not separate dates. How to show average time for separate dates?
Convert the time
column to timedelta
, then group
the dataframe by source
and date
and aggregate time
using a lambda function to calculate the mean
of diff
between rows
df['time'] = pd.to_timedelta(df['time'])
(
df.groupby(['source', 'date'])['time']
.agg(lambda d: d.diff().mean()).reset_index(name='avg')
)
source date avg
0 Investing.com 2017-05-11 0 days 04:00:00
1 yahoo.com 2017-05-11 0 days 03:00:00
2 yahoo.com 2017-05-12 0 days 06:00:00