pythonmachine-learninglinear-regressionstatsmodels

Residual Analysis for simple linear regression model


I'm trying to conduct the residual analysis for simple linear regression. I need to prove that the residuals follow an approximate Normal Distribution.

The csv file I'm using has values for Percentage of marks in Grade 10 and the Salary the student makes.
Once I run the below code, my plot looks like this:

Plot of residuals from code

The plot in the book looks like this:

Plot of residuals from the book

I was expecting my plot to show up like the book as the data is the same. I have double-checked to make sure I'm not missing any data etc. I have split the data set into training and test as per the book as well.

Data is as follows:

Percentage Salary
62  270000
76.33   200000
72  240000
60  250000
61  180000
55  300000
70  260000
68  235000
82.8    425000
59  240000
58  250000
60  180000
66  428000
83  450000
68  300000
37.33   240000
79  252000
68.4    280000
70  231000
59  224000
63  120000
50  260000
69  300000
52  120000
49  120000
64.6    250000
50  180000
74  218000
58  360000
67  150000
75  250000
60  200000
55  300000
78  330000
50.08   265000
56  340000
68  177600
52  236000
54  265000
52  200000
76  393000
64.8    360000
74.4    300000
74.5    250000
73.5    360000
57.58   180000
68  180000
69  270000
66  240000
60.8    300000

The code is below:

# Importing all required libraries for building the regression model
import pandas as pd import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Load the dataset into dataframe
mba_salary_df = pd.read_csv( 'MBA Salary.csv' )

# Add constant term of 1 to the dataset
X = sm.add_constant( mba_salary_df[‘Percentage in Grade 10’] )
Y = mba_salary_df['Salary']

# Split dataset into train and test set into 80:20 respectively
train_X, test_X, train_y, test_y = train_test_split( X, Y, train_size = 0.8,random_state = 100 )

# Fit the regression model
mba_salary_lm = sm.OLS( train_y, train_X ).fit()

mba_salary_resid = mba_salary_lm.resid 

probplot = sm.ProbPlot(mba_salary_resid) 

plt.figure( figsize = (8, 6) ) 

probplot.ppplot(line='45') 

plt.title("Normal P-P Plot of Regression Standardized Residuals") 

plt.show()

Solution

  • So, if I understand correctly, you are trying to get the residual part of a linear regression (so the error) on your training dataset, and check if the distribution of that residual part follows a normal law.

    But ppplot or qqplot need to know which law you want to compare your dataset against.

    As you probably understand, what it does is, for each sample data s, plot a point whose x coordinate is the theoretical CDF of a distribution, and y coordinate is the experimental CDF (so which proportion of s in your dataset are lower that this s).

    If you don't specify a distribution function, then, a centered and reduced normal law (μ=0, σ=1) is used by default. But your residual data have a scale way bigger than a standard deviation of 1. So, in practice, all your residuals are very very negative, or very very positive (from a standpoint of a N(0,1) law). I mean by that that either s is so negative than ℙ(X<s) is practically 0, or s is so positive that ℙ(X<s) is practically 1. (for X~N(0,1) that happens for any s lower than -3, or greater than +3, roughly. As you know ℙ(X<2.58)=99%... And 2.58 or 3 is very small compared to your values).

    So, in short, you need to say against which law you want to test your residual. If you don't, the default is a N(0,1) law that is obviously not similar at all to the distribution of your residuals (in other words: it works! and the pp-plot being bad indicates exactly what it is supposed to indicate: that, no, your residual does not follow at all a N(0,1) law).

    If you have no idea of that law (well, you already said you wanted to test against a normal law), maybe you want to fit one. Either by centering/reducing your data before (to that they are indeed supposed to follow approx a N(0,1)). Or by computing mean and stdev of your residual data and pass them as loc and scale argument to ProbPlot. Or create a normal law yourself (sta.norm(mean, stdev)) and pass that law to ProbPlot

    Or, even simpler, ask ProbPlot to fit the parameters for you (in which case, it fits a normal law. You can't choose another kind of distribution, like a weibull or Cauchy or... ; but I understand you don't want to anyway)

    So, long story short, if I understand correctly what you want to do

    probplot = sm.ProbPlot(mba_salary_resid, fit=True) 
    

    is probably what your want enter image description here