I am figuring out how to use the np.polyfit
function and the documentation confuses me. In particular, I am trying to perform linear regression and print related statistics like the sum of squared errors (SSE). Can someone provide clear and concise explanations, possibly with a minimal working example?
np.polyfit
returns a tuple containing the coefficients parametrizing the best-fitting polynomial of degree deg
. To fit a line, use deg = 1
. You can return the residual (sum of squared errors) by passing full = True
as an argument to polyfit
. Note that with this argument, polyfit
will also return some other information about the fit, which we can just discard.
Altogether, then, we have might have something like
import matplotlib.pyplot as plt
import numpy as np
# Generate some toy data.
x = np.random.rand(25)
y = 2 * x + 0.5 + np.random.normal(scale=0.05, size=x.size)
# Fit the trend line.
(m, b), (SSE,), *_ = np.polyfit(x, y, deg=1, full=True)
# Plot the original data.
plt.scatter(x, y, color='k')
# Plot the trend line.
line_x = np.linspace(0, 1, 200)
plt.plot(line_x, m * line_x + b, color='r')
plt.title(f'slope = {round(m, 3)}, int = {round(b, 3)}, SSE = {round(SSE, 3)}')
plt.show()
The *_
notation in the call to polyfit
just tells Python to discard however many additional values are returned by the function. The documentation can tell you about these extra values if you're interested. We have to parse the SSE as a tuple (SSE,)
because polyfit
returns it as a singleton array. This code produces something like this plot.
You might also like to know about np.polyval
, which will take tuples of polynomial coefficients and evaluate the corresponding function at input points.