I have historical data for crop yield, annual temperature and annual precipitation for a given region. My goal is to estimate the following linear model:
In which y is the crop annual yield, t stands for time (year), tmp for temperature (annual average) and p for precipitation (annual sum). Squared terms capture influence of extreme values.
My code is:
import pandas as pd
import statsmodels.formula.api as smf
df = pd.read_csv('https://raw.githubusercontent.com/kevinkuranyi/data/main/crop_yield.csv')
model = smf.ols(formula = 'y_banana ~ year+year2+tmp+tmp2+pre+pre2+tmp_pre+tmp2_pre2',
data=df, missing='drop').fit(cov_type='HAC', cov_kwds={'maxlags': 2})
model.summary()
By running this, I`m getting the following error message:
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:1888: ValueWarning: covariance of constraints does not have full rank. The number of constraints is 8, but rank is 5
warnings.warn('covariance of constraints does not have full '
I suspected it could be due to multicolinearity problems, but no matter which variable I ommit, as long as I include more then 4 variables (even without interaction terms, or squared values, that could be linear combinations) I got this error. I included several combinations as examples in this Colab notebook.
What could be the problem?
You are using polynomials of badly scaled data.
Calendar year and calendar year squared are badly scaled. For trend or similar use e.g. year - year0. Based on the very large standard error, tmp
has a similar problem.
Plot the polynomial functions and check that the values are approximately in the same range. For best behavior the data should be rescaled to a small range, e.g. interval [0,1] or largest value below 10.
Numpy polynomial vander
function has an option to automatically rescale the base variable.
A related blog post that I wrote a long time ago. https://jpktd.blogspot.com/2012/03/numerical-accuracy-in-linear-least.html