import os
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
csv_path = os.path.join('', 'graph.csv')
graph = pd.read_csv(csv_path)
y = graph['y'].copy()
x = graph.drop('y', axis=1)
pipeline = Pipeline([('pf', PolynomialFeatures(2)), ('clf', LinearRegression())])
pipeline.fit(x, y)
predict = [[16], [20], [30]]
plt.plot(x, y, '.', color='blue')
plt.plot(x, pipeline.predict(x), '-', color='black')
plt.plot(predict, pipeline.predict(predict), 'o', color='red')
plt.show()
My graph.csv:
x,y
1,1
2,2
3,3
4,4
5,5
6,5.5
7,6
8,6.25
9,6.4
10,6.6
11,6.8
The result produced:
It clearly is producing wrong predictions; with each x, y should increase.
What am I missing? I tried changing degrees, but it doesn't get much better. When I use degree of 4 for example, y increases very very rapidly.
@iacob provided a very good answer which I will only extend.
If you are certain that with each x, y should increase
, then perhaps your datapoints follow a logarithmic scaling pattern. Adapting your code for that yields this curve:
Here is the code snippet if that corresponds to what you are looking for:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
csv_path = os.path.join('', 'graph.csv')
graph = pd.read_csv(csv_path)
y = graph['y'].copy()
x = graph.drop('y', axis=1)
x_log = np.log(x)
pipeline = Pipeline([('pf', PolynomialFeatures(1)), ('clf', LinearRegression())])
pipeline.fit(x_log, y)
predict = np.log([[16], [20], [30]])
plt.plot(np.exp(x_log), y, '.', color='blue')
plt.plot(np.exp(x_log), pipeline.predict(x_log), '-', color='black')
plt.plot(np.exp(predict), pipeline.predict(predict), 'o', color='red')
plt.show()
Notice that we are merely doing polynomial regression (here linear regression is sufficient) on the logarithm of the x datapoints ( x_log
).