pythonmachine-learninglinear-regressiongradient-descentsupervised-learning

prediction line is too underfit


I have a question regarding how to update w,b in linear regression

After I tried to train more loops, the result of w,b doesn't seem to get close to the training set. I'm not sure what I did wrong in the code.

Here is my code (I break into parts)

1.Try to read and normalize

df = pd.read_csv('ML/cleaned_data_utf8.csv')
df.head()
df.dtypes
rmv_list = ['$',',']

for i in rmv_list:
    df['budget']=df['budget'].str.replace(i,'',regex=False)
    df['movie_gross_domestic']=df['movie_gross_domestic'].str.replace(i,'',regex=False)
    df['movie_gross_worldwide']=df['movie_gross_worldwide'].str.replace(i,'',regex=False)

df['budget']=df['budget'].astype(float)/(10**9)
df['movie_gross_domestic']=df['movie_gross_domestic'].astype(float)/(10**9)
df['movie_gross_worldwide']=df['movie_gross_worldwide'].astype(float)/(10**9)
    
df.head()
df.dtypes

2.Show example dataframe

sum_gross = df['movie_gross_domestic'] + df['movie_gross_worldwide']
df['Total_gross'] = sum_gross
df.head()

3.Add function of gradient descent and update w/b

def gradient_descent(w,b,list_x,list_y,alpha,current_index):
    diff_w = 0
    diff_b = 0
    training_size = len(list_x)
    
    for i in range(training_size):
        f_of_wb = w * list_x[i]+ b 
        diff_w_i = (f_of_wb - list_y[i]) * list_x[i]
        diff_b_i = (f_of_wb - list_y[i])
        diff_w += diff_w_i
        diff_b += diff_b_i
        
    w = w-(alpha*diff_w)*(1/training_size)
    b = b-(alpha*diff_b)*(1/training_size)
    
    sigma = 0
    for i in range(training_size):
        sigma += (list_y[i]-(w*list_x[i]+b))**2
    
    loss = sigma/training_size
    if current_index%1000 == 0:
        print(loss)
    
    
    return (w,b)

def update_w_b(num_loop,w,b,alpha,list_x,list_y):
    current_index = 0
    for i in range(num_loop):
        (w,b) = gradient_descent(w,b,list_x,list_y,alpha,current_index)
        current_index += 1
        
    return (w,b)

4.execute to train predict line with initial value

w,b = update_w_b(num_loop = 10000 ,w = 2 ,b = 0 ,alpha=1.5,list_x = list(df['budget']),list_y = list(df['Total_gross']))

5.result of prediction line vs training data (too underfit)

x_axis = df['Total_gross']
y_axis = df['budget']

plt.xlabel("Total_gross($)")
plt.ylabel("budget($)")
plt.title("Relation between movie budget vs movie gross")

# line plot
y_predict = [round(w,2)*i/10 + round(b,2) for i in range(5)]
x_predict = [i/10 for i in range(5)]

# plot
plt.scatter(x_axis,y_axis,s=4)
plt.plot(x_predict,y_predict)
print(list(map(lambda x : round(x,2),y_predict)))
print(x_predict)
# show plot
plt.show()

print(round(w,2),round(b,2))

Solution

  • You were training your weights to predict Total_gross from budget, but when plotting the results you assign x_axis to Total_gross and y_axis to budget.