numpybest-fit

How to perform linear regression with numpy.polyfit and print error statistics?


I am figuring out how to use the np.polyfit function and the documentation confuses me. In particular, I am trying to perform linear regression and print related statistics like the sum of squared errors (SSE). Can someone provide clear and concise explanations, possibly with a minimal working example?


Solution

  • np.polyfit returns a tuple containing the coefficients parametrizing the best-fitting polynomial of degree deg. To fit a line, use deg = 1. You can return the residual (sum of squared errors) by passing full = True as an argument to polyfit. Note that with this argument, polyfit will also return some other information about the fit, which we can just discard.

    Altogether, then, we have might have something like

    import matplotlib.pyplot as plt
    import numpy as np
    
    # Generate some toy data.
    x = np.random.rand(25)
    y = 2 * x + 0.5 + np.random.normal(scale=0.05, size=x.size)
    
    # Fit the trend line.
    (m, b), (SSE,), *_ = np.polyfit(x, y, deg=1, full=True)
    
    # Plot the original data.
    plt.scatter(x, y, color='k')
    
    # Plot the trend line.
    line_x = np.linspace(0, 1, 200)
    plt.plot(line_x, m * line_x + b, color='r')
    
    plt.title(f'slope = {round(m, 3)}, int = {round(b, 3)}, SSE = {round(SSE, 3)}')
    plt.show()
    

    The *_ notation in the call to polyfit just tells Python to discard however many additional values are returned by the function. The documentation can tell you about these extra values if you're interested. We have to parse the SSE as a tuple (SSE,) because polyfit returns it as a singleton array. This code produces something like this plot.

    You might also like to know about np.polyval, which will take tuples of polynomial coefficients and evaluate the corresponding function at input points.