Suppose you have this dataframe
d = {'date':['2019-08-25', '2019-09-01', '2019-09-08'],
'data':[31, 31, 31]}
df_sample = pd.DataFrame(data=d)
df_sample.head()
and you want to measure how much new data comes in on average each week. For example, we had 31 new rows on 8/25 and then on 9/1 we got an additional 31 rows so thats like a 100% increase. What I want to know is on average from one week to the next how much new data comes in?
I know there is diff() and pct_change() but since this will just be 0 in these 3 samples I am wondering what would be the better approach here.
If your data is cumulative, you need a cumsum
before pct_change
:
df_sample['change'] = df_sample['data'].cumsum().pct_change().mul(100)
Output:
date data change
0 2019-08-25 31 NaN
1 2019-09-01 31 100.0
2 2019-09-08 31 50.0
Intermediate:
date data cumsum change
0 2019-08-25 31 31 NaN
1 2019-09-01 31 62 100.0
2 2019-09-08 31 93 50.0
If you want a week to week change, go with shift
:
df_sample['change'] = df_sample['data'].div(df_sample['data'].shift()).mul(100)
Output:
date data change
0 2019-08-25 31 NaN
1 2019-09-01 31 100.0
2 2019-09-08 31 100.0