How can I get prediction intervals / prediction bands with a SciPy's curve fitting function?
More specifically, how do I get these prediction bands for the hyperbolic curve typically used in decline curve analysis?
Any help would be appreciated.
import pandas as pd
import numpy as np
from datetime import timedelta
from scipy.optimize import curve_fit
def hyperbolic_equation(t, qi, b, di):
return qi/((1.0+b*di*t)**(1.0/b))
df1 = pd.DataFrame({ 'cumsum_days': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'prod': [800, 900, 1200, 700, 600,
550, 500, 650, 625, 600,
550, 525, 500, 400, 350]})
qi = max(df1['prod'])
#Hyperbolic curve fit the data to get best fit equation
popt_hyp, pcov_hyp = curve_fit(hyperbolic_equation, df1['cumsum_days'], df1['prod'],bounds=(0, [qi,1,20]))
#Passing t to estimate the coefficients:
def fitted_hyperbolic_equation(t):
return popt_hyp[0]/((1.0+popt_hyp[1]*popt_hyp[2]*t)**(1.0/popt_hyp[1]))
#Creating future time to predict on:
df2 = pd.DataFrame({ 'future_days': [16,17,18,19,20]})
fitted_hyperbolic_equation(df2.future_days)
16 388.259631
17 368.389649
18 349.754534
19 332.264306
20 315.836485
I have my future values, but how do I generate confidence / prediction bands (95%) with SciPy? Any help would be appreciated.
I'm not sure I fully understand, but I think you are asking for uncertainties in the predicted values from a curve fitting model.
I would recommend using lmfit
for this (disclaimer: I am an author) as it provides methods to do such calculations. I'm afraid your model and data do not match very well so that the uncertainties are very large
Using lmfit
and using plain numpy
array instead of pandas
dataframes (those can be used, but they're a distraction here - the fits really need numpy
arrays), your analysis might look like this:
import numpy as np
from lmfit import Model
import matplotlib.pyplot as plt
cumsum_days = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
prod = np.array([800, 900, 1200, 700, 600, 550, 500, 650, 625, 600, 550,
525, 500, 400, 350])
# plot data
plt.plot(cumsum_days, prod, 'bo', label='data')
def hyperbolic_equation(t, qi, b, di):
return qi/((1.0+b*di*t)**(1.0/max(b, 1.e-50)))
# build Model
hmodel = Model(hyperbolic_equation)
# create lmfit Parameters, named from the arguments of `hyperbolic_equation`
# note that you really must provide initial values.
params = hmodel.make_params(qi=1000, b=0.5, di=0.1)
# set bounds on parameters
params['qi'].min=0
params['b'].min=0
params['di'].min=0
# do fit, print resulting parameters
result = hmodel.fit(prod, params, t=cumsum_days)
print(result.fit_report())
# plot best fit: not that great of fit, really
plt.plot(cumsum_days, result.best_fit, 'r--', label='fit')
# calculate the (1 sigma) uncertainty in the predicted model
# and plot that as a confidence band
dprod = result.eval_uncertainty(result.params, sigma=1)
plt.fill_between(cumsum_days,
result.best_fit-dprod,
result.best_fit+dprod,
color="#AB8888",
label='uncertainty band of fit')
# now evaluate the model for other values, predicting future values
future_days = np.array([16,17,18,19,20])
future_prod = result.eval(t=future_days)
plt.plot(future_days, future_prod, 'k--', label='prediction')
# ...and calculate the 1-sigma uncertainty in the future prediction
# for 95% confidence level, you'd want to use `sigma=2` here:
future_dprod = result.eval_uncertainty(t=future_days, sigma=1)
print("### Prediction\n# Day Prod Uncertainty")
for day, prod, eps in zip(future_days, future_prod, future_dprod):
print(" {:.1f} {:.1f} +/- {:.1f}".format(day, prod, eps))
plt.fill_between(future_days,
future_prod-future_dprod,
future_prod+future_dprod,
color="#ABABAB",
label='uncertainty band of prediction')
plt.legend(loc='lower left')
plt.show()
This will print out resulting fit statistics and parameter values of
[[Model]]
Model(hyperbolic_equation)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 21
# data points = 15
# variables = 3
chi-square = 238946.482
reduced chi-square = 19912.2068
Akaike info crit = 151.139170
Bayesian info crit = 153.263321
[[Variables]]
qi: 993.608482 +/- 163.710950 (16.48%) (init = 1000)
b: 0.22855837 +/- 2.07615175 (908.37%) (init = 0.5)
di: 0.06551315 +/- 0.06250023 (95.40%) (init = 0.1)
[[Correlations]] (unreported correlations are < 0.100)
C(b, di) = 0.963
C(qi, di) = 0.888
C(qi, b) = 0.771
### Prediction
# Day Prod Uncertainty
16.0 388.258 +/- 1080.106
17.0 368.387 +/- 1106.336
18.0 349.752 +/- 1130.091
19.0 332.261 +/- 1151.634
20.0 315.833 +/- 1171.196
and give a plot like this:
In your question, you did not inspect the quality of the fit either statistically or graphically. Really, you will want to do this.
You also used curve_fit
without providing initial values. Although no underlying fitting routine will ever support that and all require explicit initial values, curve_fit
permits this without warning or justification and asserts that all starting values will be 1.0
. Really, you have to provide initial values.