pythonmachine-learninglinear-regression

Calculate residual values from trainfset or test set


I want to perform Residual analysis, and i know that residuals equal the observed values minus the predicted ones. But i don't know should i calculate residuals from the training set or the test set ?

Should i use this:

import statsmodels.api as sm 
# Making predictions
lm = sm.OLS(y_train,X_train).fit()

y_pred = lm.predict(X_train)
resid = y_train - y_pred.to_frame('price')

OR this:

import statsmodels.api as sm 
# Making predictions
lm = sm.OLS(y_train,X_train).fit()

y_pred = lm.predict(X_test)
resid = y_test- y_pred.to_frame('price')

Solution

  • The residual error should be computed from the actual values (expected outcome) of the test set y_test and the predicted values by the fitted model for X_test. The model is fitted to the training set and then its accuracy is tested on the test set. This is how I see it intuitively, the main reason in the first place to formally call the two datasets as train (for training) and then for testing (test).

    Specifically, use the second case

    resid = y_test- y_pred.to_frame('price')