machine-learninglinear-regressiongradient-descentunsupervised-learningcalculus

Implementing linear regression from scratch in python


I'm trying to Implement linear regression in python using the following gradient decent formulas (Notice that these formulas are after partial derive) slope y_intercept

but the code keeps giving me wearied results ,I think (I'm not sure) that the error is in the gradient_descent function

import numpy as np


class LinearRegression:
    def __init__(self , x:np.ndarray ,y:np.ndarray):
        self.x = x
        self.m = len(x)
        self.y = y


    def calculate_predictions(self ,slope:int , y_intercept:int) -> np.ndarray: # Calculate y hat.
        predictions = []

        for x in self.x:
            predictions.append(slope * x + y_intercept)

        return predictions

    def calculate_error_cost(self , y_hat:np.ndarray) -> int:
        error_valuse = []
        for i in range(self.m):
            error_valuse.append((y_hat[i] - self.y[i] )** 2)

        error = (1/(2*self.m)) * sum(error_valuse)
    
        return error
    

    def gradient_descent(self):
        costs = []

        # initialization values        
        temp_w = 0
        temp_b = 0
        
        a = 0.001 # Learning rate

        while True:
            y_hat = self.calculate_predictions(slope=temp_w , y_intercept= temp_b)
            
            sum_w = 0
            sum_b = 0

            for i in range(len(self.x)):
                sum_w += (y_hat[i] - self.y[i] ) * self.x[i]
                sum_b += (y_hat[i] - self.y[i] )

            w = temp_w - a * ((1/self.m) *sum_w)
            b = temp_b - a * ((1/self.m) *sum_b)
            temp_w = w
            temp_b = b


            costs.append(self.calculate_error_cost(y_hat))

            try:
                if costs[-1] > costs[-2]: # If global minimum reached
                    return [w,b]
            except IndexError:
                pass

I Used this dataset:- https://www.kaggle.com/datasets/tanuprabhu/linear-regression-dataset?resource=download

after downloading it like this:

import pandas

p = pandas.read_csv('linear_regression_dataset.csv') 

l = LinearRegression(x= p['X'] , y= p['Y'])
print(l.gradient_descent())

But It's giving me [-568.1905905426412, -2.833321633515304] Which is decently not accurate.

I want to implement the algorithm not using external modules like scikit-learn for learning purposes.

I tested the calculate_error_cost function and it worked as expected and I don't think that there is an error in the calculate_predictions function


Solution

  • One small problem you have is that you are returning the last values of w and b, when you should be returning the second-to-last parameters (because they yield a lower cost). This should not really matter that much... unless your learning rate is too high and you are immediately getting a higher value for the cost function on the second iteration. This I believe is your real problem, judging from the dataset you shared.

    The algorithm does work on the dataset, but you need to change the learning rate. I ran it in the example below and it gave the result shown in the image. One caveat is that I added a limit to the iterations to avoid the algorithm from taking too long (and only marginally improving the result).

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    
    
    class LinearRegression:
        def __init__(self , x:np.ndarray ,y:np.ndarray):
            self.x = x
            self.m = len(x)
            self.y = y
    
        def calculate_predictions(self ,slope:int , y_intercept:int) -> np.ndarray: # Calculate y hat.
            predictions = []
    
            for x in self.x:
                predictions.append(slope * x + y_intercept)
    
            return predictions
    
        def calculate_error_cost(self , y_hat:np.ndarray) -> int:
            error_valuse = []
            for i in range(self.m):
                error_valuse.append((y_hat[i] - self.y[i] )** 2)
    
            error = (1/(2*self.m)) * sum(error_valuse)
        
            return error
        
        def gradient_descent(self):
            costs = []
    
            # initialization values        
            temp_w = 0
            temp_b = 0
            iteration = 0
            
            a = 0.00001 # Learning rate
    
            while iteration < 1000:
                y_hat = self.calculate_predictions(slope=temp_w , y_intercept= temp_b)
                
                sum_w = 0
                sum_b = 0
    
                for i in range(len(self.x)):
                    sum_w += (y_hat[i] - self.y[i] ) * self.x[i]
                    sum_b += (y_hat[i] - self.y[i] )
    
                w = temp_w - a * ((1/self.m) *sum_w)
                b = temp_b - a * ((1/self.m) *sum_b)
    
                costs.append(self.calculate_error_cost(y_hat))
    
                try:
                    if costs[-1] > costs[-2]: # If global minimum reached
                        print(costs)
                        return [temp_w,temp_b]
                except IndexError:
                    pass
    
                temp_w = w
                temp_b = b
                iteration += 1
                print(iteration)
    
            return [temp_w,temp_b]
    
    p = pd.read_csv('linear_regression_dataset.csv')
    
    x_data = p['X']
    y_data = p['Y']
    lin_reg = LinearRegression(x_data, y_data)
    y_hat = lin_reg.calculate_predictions(*lin_reg.gradient_descent())
    
    fig = plt.figure()
    plt.plot(x_data, y_data, 'r.', label='Data')
    plt.plot(x_data, y_hat, 'b-', label='Linear Regression')
    plt.xlabel('x')
    plt.ylabel('y')
    plt.legend()
    plt.show()
    

    Dataset and regression result