I want to regress very accurately a target which depends non-linearly from a single variable x. In my sklearn pipeline, I use:
pipe = Pipeline([('poly', PolynomialFeatures(3, include_bias=False)), \
('regr', ElasticNet(random_state=0))])
This appears to give similar results to np.polyfit(x, y, 3)
in terms of accuracy. However, I can basically go to machine precision by using cubic splines. See figure below where I show the data and various fits, along with the residual errors. [Note: the example below has 50 samples. I have 2000 samples in reality]
I have two questions:
polyfit
or polyfeat + Elasticnet
aren't able to reach the same level of accuracy?scikit-learn
? import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
%matplotlib inline
data = pd.read_csv('example.txt') # added to this post below
p = np.polyfit(data['x'], data['y'], 3)
data['polyfit'] = np.poly1d(p)(data['x'])
f = interp1d(data['x'], data['y'], kind='cubic')
data['spline'] = f(data['x'])
fig, axes = plt.subplots(nrows=3, sharex=True)
axes[0].plot(data['x'], data['polyfit'],'.', label='polyfit')
axes[0].plot(data['x'], data['spline'],'.', label='spline')
axes[0].plot(data['x'], data['y'],'.', label='true')
axes[0].legend()
axes[1].plot(data['x'], data['polyfit']-data['y'],'.', label='error polyfit')
axes[1].legend()
axes[2].plot(data['x'], data['spline']-data['y'],'.', label='error spline')
axes[2].legend()
plt.show()
Here are the data:
example.txt:
,x,y
257,6.26028462060192,-1233.748349982897
944,4.557099191827032,928.1430280794456
1560,6.765081341690966,-1807.9090703632864
504,4.0015671921214775,1683.311523022658
1499,3.0496689401255783,3055.291788377236
1247,5.608726443314061,-441.9226126757499
1856,4.6124942196224845,845.129184983355
1495,1.273838638033053,5479.078773760113
1052,5.353775782315625,-115.14032709875217
247,2.6495185259267076,3656.7467318648232
1841,9.73337795053495,-4884.806993807511
1574,1.1772247845133335,5544.080005636716
1116,5.698561786140379,-555.3435567718
1489,4.184371293153768,1427.6922357286753
603,1.568868565047676,5179.156099377357
358,4.534081088923849,960.3983442022857
774,9.304809492028289,-4468.215701489676
1525,9.17423541311121,-4340.565494266174
1159,6.705834877066449,-1750.189447626367
1959,3.0431599461645207,3065.358649171256
1086,1.3861557136230234,5378.274828554064
81,4.728366950632029,682.7245723055514
1791,6.954198834068505,-2027.0414501796324
234,2.8672306789699844,3330.7282514295102
1850,2.0086469278742363,4603.0931759401155
1531,9.843164998128215,-4973.735518791005
903,1.534448692052103,5220.331847067942
1258,7.243723209152924,-2354.629822080041
645,2.3302780902754514,4128.077572586273
1425,3.295574067849755,2694.766296765896
311,2.3225198086033756,4152.206609684557
219,8.479436097125713,-3665.2515034579396
1917,7.1524135031820135,-2253.3455629418195
1412,6.79800860136838,-1861.3756670478142
705,1.9001265482939966,4756.283634364785
663,3.441268690856777,2489.7632239249424
1871,6.473544271091015,-1480.6593600880415
1897,8.217615163361007,-3386.5427698021977
558,6.609652057181176,-1634.1672307700298
553,5.679571371137544,-524.352981663938
1847,6.487178186324092,-1500.1891501936236
752,9.377368455681758,-4548.188126821915
1469,8.586759667609758,-3771.691600599668
1794,6.649801445466815,-1674.4870918398076
968,1.6226439291315056,5117.8804886837
108,3.0077346937655647,3118.0786841570025
96,6.278616413290749,-1245.4758811316083
994,7.631678455127069,-2767.3224262153176
871,2.6696610777085863,3630.02481913033
1405,9.209358577104299,-4368.622350004463
The former two methods fit a single cubic equation to your data, but (as the name implies) interp1d
interpolates the data with cubic splines: that is, there is a cubic curve for each consecutive pair of points, and so you are guaranteed a perfect fit (up to computational precision).