My questions is not around how to calculate IRR (internal rate of return) but rather given a data set similar to the below, how to best calculate IRR without waiting months for the result when the sample size dramatically increases.
I am using the np.irr
function
import pandas as pd
import numpy as np
date_list =['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04','2018-01-05', '2018-01-06', '2018-01-07', '2018-01-14','2018-01-21', '2018-01-31','2018-02-08', '2018-02-28']
ids_list = [1,1,1,1,2,2,2,2,3,3,3,3]
flows_list = [ -10, 2, 2, 10, -50, 25, 20, 20, -100, 0, 3, 150]
df = pd.DataFrame(list(zip(date_list,ids_list,flows_list)), columns=['Date','ID','Flow'])
df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m-%d')
Yields the Below Dataset
In [144]: df
Out[144]:
Date ID Flow
0 2018-01-01 1 -10
1 2018-01-02 1 2
2 2018-01-03 1 2
3 2018-01-04 1 10
4 2018-01-05 2 -50
5 2018-01-06 2 25
6 2018-01-07 2 20
7 2018-01-14 2 20
8 2018-01-21 3 -100
9 2018-01-31 3 0
10 2018-02-08 3 3
11 2018-02-28 3 150
Data Explanation
Date
is the day the cash flowed in or out.
ID
is essentially the unique id of each investment made.
Flow
is the cash flow of that ID
(investment).
I need use a daily frequency as my input for np.irr
IF I do a simple pandas.groupby
In [145]: df.groupby(['ID'])['Flow'].agg(np.irr)
Out[145]:
ID
1 0.141962
2 0.150155
3 0.153450
Name: Flow, dtype: float64
So for ID
1 the np.irr
returned makes sense as my frequency is consistent.
However, for the rest you'll see the dates are not equally spaced by day.
Example of 'Manually' calculating np.irr
for ID 3
df.loc[df.ID ==3]['Date'].apply(lambda x: (x - min(df.loc[df.ID ==3]['Date'])).days)
8 0
9 10
10 18
11 38
Name: Date, dtype: int64
Can see above that each cashflow occurs at the beginning, on the 10th, 18th & Finally on the 38th day.
cfs = np.zeros(39)
cfs[[0,10,18,38]] = df.loc[df.ID ==3]['Flow'].values
np.irr(cfs)
This yields the actual np.irr
for ID
3:
Out[155]: 0.011386397119650837
How to calculate
np.irr
across apandas.DataFrame
where the cashflow frequencies are inconsistent in an optimal manner ?
This seems to be the most optimal and accurate way I could find. Avoiding for loops!!
Load Example Data
import pandas as pd
import numpy as np
date_list =['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04','2018-01-05', '2018-01-06', '2018-01-07', '2018-01-14','2018-01-21', '2018-01-31','2018-02-08', '2018-02-28']
ids_list = [1,1,1,1,2,2,2,2,3,3,3,3]
flows_list = [ -10, 2, 2, 10, -50, 25, 20, 20, -100, 0, 3, 150]
df = pd.DataFrame(list(zip(date_list,ids_list,flows_list)), columns=['Date','ID','Flow'])
df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m-%d')
Re-index data frame to include 0's cashflows
def reindex_by_date_and_fill(df,groupby_column='ID',value_column='Flow'):
dates = pd.date_range(df.index.min(), df.index.max())
return pd.concat([df.reindex(dates)[groupby_column].ffill(),df.reindex(dates,fill_value=0)[value_column]],axis=1)
df_test = df.set_index(['Date']).groupby(['ID'],as_index=False).apply(reindex_by_date_and_fill).reset_index(0,drop=True)
I got this idea from the below two posts:
Add missing dates to pandas dataframe
Pandas reindex dates in Groupby
Basically it fills in the days which are missing and populates the cash flow as a zero. This allows you to get a daily frequency across each investment while still maintaining the period in which the cash flows were paid back.
In [54]: df_test.head(10)
Out[54]:
ID Flow
2018-01-01 1.0 -10
2018-01-02 1.0 2
2018-01-03 1.0 2
2018-01-04 1.0 10
2018-01-05 2.0 -50
2018-01-06 2.0 25
2018-01-07 2.0 20
2018-01-08 2.0 0
2018-01-09 2.0 0
2018-01-10 2.0 0
This Allows you then to use
groupby
In [60]: df_test.groupby(['ID'])['Flow'].agg(np.irr)
Out[60]:
ID
1.0 0.141962
2.0 0.082212
3.0 0.011386
Name: Flow, dtype: float64