I'm trying to conduct the residual analysis for simple linear regression. I need to prove that the residuals follow an approximate Normal Distribution.
The csv file I'm using has values for Percentage of marks in Grade 10 and the Salary the student makes.
Once I run the below code, my plot looks like this:
The plot in the book looks like this:
I was expecting my plot to show up like the book as the data is the same. I have double-checked to make sure I'm not missing any data etc. I have split the data set into training and test as per the book as well.
Data is as follows:
Percentage Salary
62 270000
76.33 200000
72 240000
60 250000
61 180000
55 300000
70 260000
68 235000
82.8 425000
59 240000
58 250000
60 180000
66 428000
83 450000
68 300000
37.33 240000
79 252000
68.4 280000
70 231000
59 224000
63 120000
50 260000
69 300000
52 120000
49 120000
64.6 250000
50 180000
74 218000
58 360000
67 150000
75 250000
60 200000
55 300000
78 330000
50.08 265000
56 340000
68 177600
52 236000
54 265000
52 200000
76 393000
64.8 360000
74.4 300000
74.5 250000
73.5 360000
57.58 180000
68 180000
69 270000
66 240000
60.8 300000
The code is below:
# Importing all required libraries for building the regression model
import pandas as pd import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# Load the dataset into dataframe
mba_salary_df = pd.read_csv( 'MBA Salary.csv' )
# Add constant term of 1 to the dataset
X = sm.add_constant( mba_salary_df[‘Percentage in Grade 10’] )
Y = mba_salary_df['Salary']
# Split dataset into train and test set into 80:20 respectively
train_X, test_X, train_y, test_y = train_test_split( X, Y, train_size = 0.8,random_state = 100 )
# Fit the regression model
mba_salary_lm = sm.OLS( train_y, train_X ).fit()
mba_salary_resid = mba_salary_lm.resid
probplot = sm.ProbPlot(mba_salary_resid)
plt.figure( figsize = (8, 6) )
probplot.ppplot(line='45')
plt.title("Normal P-P Plot of Regression Standardized Residuals")
plt.show()
So, if I understand correctly, you are trying to get the residual part of a linear regression (so the error) on your training dataset, and check if the distribution of that residual part follows a normal law.
But ppplot
or qqplot
need to know which law you want to compare your dataset against.
As you probably understand, what it does is, for each sample data s
, plot a point whose x
coordinate is the theoretical CDF of a distribution, and y
coordinate is the experimental CDF (so which proportion of s
in your dataset are lower that this s
).
If you don't specify a distribution function, then, a centered and reduced normal law (μ=0, σ=1) is used by default. But your residual data have a scale way bigger than a standard deviation of 1. So, in practice, all your residuals are very very negative, or very very positive (from a standpoint of a N(0,1) law). I mean by that that either s is so negative than ℙ(X<s) is practically 0, or s is so positive that ℙ(X<s) is practically 1. (for X~N(0,1) that happens for any s lower than -3, or greater than +3, roughly. As you know ℙ(X<2.58)=99%... And 2.58 or 3 is very small compared to your values).
So, in short, you need to say against which law you want to test your residual. If you don't, the default is a N(0,1) law that is obviously not similar at all to the distribution of your residuals (in other words: it works! and the pp-plot being bad indicates exactly what it is supposed to indicate: that, no, your residual does not follow at all a N(0,1) law).
If you have no idea of that law (well, you already said you wanted to test against a normal law), maybe you want to fit one. Either by centering/reducing your data before (to that they are indeed supposed to follow approx a N(0,1)).
Or by computing mean and stdev of your residual data and pass them as loc and scale argument to ProbPlot
.
Or create a normal law yourself (sta.norm(mean, stdev)
) and pass that law to ProbPlot
Or, even simpler, ask ProbPlot
to fit the parameters for you (in which case, it fits a normal law. You can't choose another kind of distribution, like a weibull or Cauchy or... ; but I understand you don't want to anyway)
So, long story short, if I understand correctly what you want to do
probplot = sm.ProbPlot(mba_salary_resid, fit=True)