I understand the zig-zag nature of the cost function when applying gradient descent, but what bothers me is that the cost started out at a low 300 only to increase to 1600 in the end.
The cost function would oscillate between 300 and 4000 to end up at 1600. I thought I should get a number that is 300 or lower. I have tried changing the learning rate and all it does is still take me to 1600. I should get a cost around 300, not one that grows it.
Data:
square_feet = [1661.0, 871.0, 1108.0, 1453.0, 1506.0, 1100.0, 1266.0, 1514.0, 948.0, 1878.0, 1522.0, 931.0, 1475.0, 1177.0, 1844.0, 1469.0, 2155.0, 967.0, 1092.0]
prices = [1350.0, 489.0, 589.0, 539.0, 775.0, 575.0, 749.0, 795.0, 644.9, 590.0, 575.0, 699.0, 999.0, 775.0, 599.0, 599.0, 895.0, 550.0, 849.0]
Both of those lists are Pandas series in the original code, but have converted them to lists here for clarity.
Main:
# Add starting weight and bias
w_init = 5e-1 # Increase in price for every 1 square feet
b_init = 200 # Starting price for the cheapest houses
# Iterations and learning rate for the gradient descent algorithm
iterations = 10000
alpha = 1.0e-6 # Delicate and causes a divergence if it's set too large
w_final, b_final, J_hist, p_hist = gradient_descent(
square_feet, prices, w_init, b_init, alpha, iterations)
print(f'w: {w_final}, b: {b_final}, Costs: {J_hist}, Weight and Bias: {p_hist}')
Functions:
# Cost Function to determine cumulative error between real and predicted values
def cost_function(x, y, w, b):
# 1) Number of training examples
m = x.size
cost = 0
# 2) Index the training examples and account for cost per instance
for i in range(m):
y_hat = w * x[i] + b
cost += (y_hat-y[i])**2
cost /= 2 * m
# 3) Return total cost
return cost
# Compute the gradient, i.e., the scalar that improves accuracy
def gradient_function(x, y, w, b):
m = x.size
# Partial derivatives of the cost function with respect to weight and bias
dj_dw = 0
dj_db = 0
for i in range(m):
y_hat = w * x[i] + b
dj_dw_i = (y_hat - y[i]) * x[i]
dj_db_i = (y_hat - y[i])
dj_db += dj_db_i
dj_dw += dj_dw_i
dj_dw /= m
dj_db /= m
return dj_dw, dj_db
def gradient_descent(x, y, w_init, b_init, learning_rate,
num_iters):
# Used for graphing
J_history = []
p_history = []
b = b_init
w = w_init
for i in range(num_iters):
dj_dw, dj_db = gradient_function(x, y, w, b) # Gradient
# Update weight, bias
w -= learning_rate * dj_dw
b -= learning_rate * dj_db
# Prevents resource exhaustion; unnecessary to store similar costs
# Past 100000 iterations
if i < 100000:
J_history.append(cost_function(x, y, w, b))
p_history.append([w,b])
return w, b, J_history, p_history
I'm stumped over this issue.
In your cost_function
, the average cost is inside the loop, which results in a cost value that is m times smaller than it should be, preventing the data set from converging. You should sum up all data before averaging.
def cost_function(x, y, w, b):
# 1) Number of training examples
m = x.size
cost = 0
# 2) Index the training examples and account for cost per instance
for i in range(m):
y_hat = w * x[i] + b
cost += (y_hat-y[i])**2
cost /= 2 * m
# 3) Return total cost
return cost