machine-learninggradient-descentderivativemultivariate-testing

Multivariate Linear Regression using gradient descent


I am learning Multivariate Linear Regression using gradient descent. I have written below python code:

    import pandas as pd
    import numpy as np
    
    x1 = np.array([1,2,3,4,5,6,7,8,9,10],dtype='float64')  
    x2 = np.array([5,10,20,40,80,160,320,640,1280,2560],dtype='float64')
    y = np.array([350,700,1300,2400,4500,8600,16700,32800,64900,129000],dtype='float64')
    
    def multivar_gradient_descent(x1,x2,y):
        w1=w2=w0=0
        iteration=500
        n=len(x1)
        learning_rate=0.02
        
        for i in range(iteration):
            y_predicted = w1 * x1 + w2 * x2 +w0 
            cost = (1*(2/n))*float(sum((y_predicted-y)**2))  # cost function
            
            x1d = sum(x1*(y_predicted-y))/n  # derivative for feature x1
            x2d = sum(x2*(y_predicted-y))/n   # derivative for feature x2
            cd =  sum(1*(y-y_predicted))/n # derivative for bias
    
            w1 = w1 - learning_rate * x1d
            w2 = w2 - learning_rate * x2d
            w0 = w0 - learning_rate * cd
            print(f"Iteration {i}: a= {w1}, b = {w2}, c = {w0}, cost = {cost} ")
    
        return w1,w2, w0
    
    w1,w2,w0 = multivar_gradient_descent(x1,x2,y)
    w1,w2,w0

However, the result is the cost function kept getting higher and higher until it became inf (shown below). I have spent hours checking the formula of derivatives and cost function, but I couldn't identify where the mistake is. I feel so frustrated, and hope someone could help me with this. Thank you.

Iteration 0: a= 4685.5, b = 883029.5, c = -522.5, cost = 4462002500.0 
Iteration 1: a= -81383008.375, b = -15430704757.735, c = 9032851.74, cost = 1.3626144151911089e+18 
Iteration 2: a= 1422228350500.3176, b = 269662832866446.66, c = -157855848816.2755, cost = 4.161440004246925e+26 
Iteration 3: a= -2.4854478828631716e+16, b = -4.712554891970221e+18, c = 2758646212375989.0, cost = 1.2709085355243152e+35 
Iteration 4: a= 4.343501644116814e+20, b = 8.235533749226551e+22, c = -4.820935671838988e+19, cost = 3.881369199171854e+43 
Iteration 5: a= -7.590586253095058e+24, b = -1.4392196523846473e+27, c = 8.424937075201089e+23, cost = 1.1853745914189544e+52 
Iteration 6: a= 1.326510368511469e+29, b = 2.5151414235959125e+31, c = -1.472319266480111e+28, cost = 3.620147555871397e+60 
Iteration 7: a= -2.3181737208386835e+33, b = -4.3953932745475034e+35, c = 2.5729854159139745e+32, cost = 1.105597202871857e+69 
Iteration 8: a= 4.051177832870898e+37, b = 7.681270666011396e+39, c = -4.496479874458965e+36, cost = 3.37650649906685e+77 
Iteration 9: a= -7.079729049644685e+41, b = -1.3423581317783506e+44, c = 7.857926879944079e+40, cost = 1.0311889455424087e+86 
Iteration 10: a= 1.2372343423113349e+46, b = 2.3458688442326932e+48, c = -1.3732300949746233e+45, cost = 3.1492628303921182e+94 
Iteration 11: a= -2.1621573467862958e+50, b = -4.099577083092681e+52, c = 2.3998198539580117e+49, cost = 9.617884692967256e+102 
Iteration 12: a= 3.7785278280657085e+54, b = 7.164310273158479e+56, c = -4.193860411686855e+53, cost = 2.937312982406619e+111 
Iteration 13: a= -6.603253259383672e+58, b = -1.2520155286691985e+61, c = 7.32907727374022e+57, cost = 8.970587433766233e+119 
Iteration 14: a= 1.1539667190934036e+63, b = 2.187988549158328e+65, c = -1.280809765026251e+62, cost = 2.739627659321216e+128 
Iteration 15: a= -2.0166410956339498e+67, b = -3.823669740212017e+69, c = 2.238308579532037e+66, cost = 8.366854196711946e+136 
Iteration 16: a= 3.524227554668779e+71, b = 6.682142046784112e+73, c = -3.9116076672823015e+70, cost = 2.5552468384109146e+145 
Iteration 17: a= -6.158844964518726e+75, b = -1.1677531106785476e+78, c = 6.835819994909099e+74, cost = 7.80375306142527e+153 
Iteration 18: a= 1.0763031248287995e+80, b = 2.0407338215081817e+82, c = -1.194609454154816e+79, cost = 2.3832751078395456e+162 
Iteration 19: a= -1.8809182942418207e+84, b = -3.5663313522046286e+86, c = 2.0876672425822773e+83, cost = 7.278549429920333e+170 
Iteration 20: a= 3.287042049772272e+88, b = 6.232424424816986e+90, c = -3.648350932258958e+87, cost = 2.2228773182554595e+179 
Iteration 21: a= -5.744345977200645e+92, b = -1.0891616727381027e+95, c = 6.375759629418162e+91, cost = 6.788692746528022e+187 
Iteration 22: a= 1.0038664004334024e+97, b = 1.9033895455483145e+99, c = -1.1142105462686083e+96, cost = 2.0732745270409844e+196 
Iteration 23: a= -1.7543298295730705e+101, b = -3.326312202113057e+103, c = 1.9471642809242535e+100, cost = 6.331804111587467e+204 
Iteration 24: a= 3.065819465220816e+105, b = 5.812973435628952e+107, c = -3.402811748286256e+104, cost = 1.9337402155196325e+213 
Iteration 25: a= -5.357743358678581e+109, b = -1.0158595498601174e+112, c = 5.946661977991267e+108, cost = 5.905664728753603e+221 
Iteration 26: a= 9.363047701635277e+113, b = 1.7752887338463183e+116, c = -1.0392225987316703e+113, cost = 1.8035967607506306e+230 
Iteration 27: a= -1.6362609478315793e+118, b = -3.102446680700735e+120, c = 1.816117367544431e+117, cost = 5.508205129817299e+238 
Iteration 28: a= 2.8594854738709632e+122, b = 5.421752091975047e+124, c = -3.1737976990896245e+121, cost = 1.6822121447766637e+247 
Iteration 29: a= -4.997159643830032e+126, b = -9.474907636509772e+128, c = 5.546443206127292e+125, cost = 5.13749512471037e+255 
Iteration 30: a= 8.732901332811723e+130, b = 1.655809288168471e+133, c = -9.692814462503292e+129, cost = 1.5689968853439082e+264 
Iteration 31: a= -1.5261382690222234e+135, b = -2.8936476258832726e+137, c = 1.6938900970034892e+134, cost = 4.791734427889445e+272 
Iteration 32: a= 2.667038052317318e+139, b = 5.056860498736353e+141, c = -2.960196619698286e+138, cost = 1.46340117318896e+281 
Iteration 33: a= -4.660843723593812e+143, b = -8.837232935670386e+145, c = 5.173159724337836e+142, cost = 4.4692439155775235e+289 
Iteration 34: a= 8.145164706926056e+147, b = 1.5443709783730996e+150, c = -9.040474323708519e+146, cost = 1.364912201990395e+298 
Iteration 35: a= -1.4234270024354842e+152, b = -2.698901043124031e+154, c = 1.5798888948493553e+151, cost = 4.168457471405497e+306 
Iteration 36: a= 2.487542614748579e+156, b = 4.716526626425798e+158, c = -2.760971195418877e+155, cost = inf 
Iteration 37: a= -4.347162341028204e+160, b = -8.24247464517401e+162, c = 4.824998749459281e+159, cost = inf 
Iteration 38: a= 7.596983588224419e+164, b = 1.4404326246286964e+167, c = -8.432037599998082e+163, cost = inf 
Iteration 39: a= -1.3276283495338805e+169, b = -2.517261181154549e+171, c = 1.473560135031107e+168, cost = inf 
Iteration 40: a= 2.32012747430196e+173, b = 4.399097705650062e+175, c = -2.5751539243057795e+172, cost = inf 

Solution

  • The issue here is that you initialized the weights to be 0 as indicated in w1=w2=w0=0.

    If all the weights are initialized with 0, the derivative with respect to loss function is the same for every w in W[l], thus all weights have the same value in subsequent iterations.

    With that we will have to initialize the weights to a random value.

    Weight initialization with a large random value:

    When the weights are initialized with a very high value, the term np.dot(W,X)+b becomes significantly higher and if an activation function like sigmoid() is applied, the function maps its value near to 1 where the slope of gradient changes slowly and learning takes a lot of time.

    There are many ways in which you could initialize the weights for example in Keras, Dense, LSTM and CNN layers are all initialized with the glorot_uniform otherwise known as the Xavier initialization.

    For your purposes you can follow the following formula to randomly initialize the weights using numpy's random.randn where l is a particular layer. This will result in the weights randomly being initialized with a value between 0 to 1:

    # Specify the random seed value for reproducibility.
    np.random.seed(3)
    W[l] = np.random.randn(l, l-1)
    

    Another thing you should do is Feature Normalization as a preprocessing step where you return a normalized version of the data where the mean value of each feature is 0 and the standard deviation is 1. This is often a good preprocessing step to do when working with learning algorithms.

    def  featureNormalize(X):
        """
        X : The dataset of shape (m x n)
        """
        X_norm = X.copy()
        mu = np.zeros(X.shape[1])
        sigma = np.zeros(X.shape[1])
    
        mu = np.mean(X, axis=0)
        sigma = np.std(X, axis=0)
        X_norm = (X-mu)/ sigma
        return X_norm