deep-learningneural-networkpytorchlinear-regression

Neural network learning to sum two numbers


I am learning Pytorch, and I am trying to implement a really simple network which takes an input which is of length 2, i.e. a point in the plane, and aims to learn the sum of its components.

In principle the network should just learn a linear layer with weight matrix W = [1.,1.] and zero bias, so I expect to have very low training error. However, I don't see why I do not get this as expected.

The code I am writing is this:

import torch
from torch import nn, optim
import numpy as np

device = torch.device("cuda:0" if      
torch.cuda.is_available() else "cpu")

N = 1000  # number of samples
D = 2  # input dimension
C = 1  # output dimension

def model(z):
  q = z[:,0]
  p = z[:,1]
  return q+p

X = torch.rand(N, D,requires_grad=True).to(device)
y = model(X)

lr = 1e-2 #Learning rate

Rete = nn.Sequential(nn.Linear(D, C))
Rete.to(device) #Convert to CUDA

criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(Rete.parameters(), lr=lr)

for t in range(5000):

    y_pred = Rete(X)
    loss = criterion(y_pred, y)
    print("[EPOCH]: %i, [LOSS]: %.6f" % (t, loss.item()))    
    optimizer.zero_grad()
    optimizer.step()

Solution

  • There are 2 problems.

    The first problem is that you forgot to backpropogate the loss:

    optimizer.zero_grad()
    loss.backward() # you forgot this step
    optimizer.step()
    

    It is important that optimizer.zero_grad() IS NOT put in between loss.backwards() and optimizer.step(), otherwise you'll be resetting the gradients before performing the gradient descent step. In general, it is advisable to either put optimizer.zero_grad() at the very beginning of your loop, or right after you call optimizer.step():

    loss.backward()
    optimizer.step()
    optimizer.zero_grad() # this should go right after .step(), or at the very beginning of your training loop
    

    Even after this change, you'll notice that your model still doesn't converge:

    [EPOCH]: 0, [LOSS]: 0.232405
    [EPOCH]: 1, [LOSS]: 0.225010
    [EPOCH]: 2, [LOSS]: 0.218473
    ...
    [EPOCH]: 4997, [LOSS]: 0.178762
    [EPOCH]: 4998, [LOSS]: 0.178762
    [EPOCH]: 4999, [LOSS]: 0.178762
    

    This leads us to the second problem, the shapes of the output (y_pred) and labels (y) do not match. y_pred has shape (N, C) but y has shape (N,). To fix this, just reshape y to match y_pred:

    y = y.reshape(-1, C)
    

    Then our model will converge:

    [EPOCH]: 0, [LOSS]: 1.732189
    [EPOCH]: 1, [LOSS]: 1.680017
    [EPOCH]: 2, [LOSS]: 1.628712
    ...
    [EPOCH]: 4997, [LOSS]: 0.000000
    [EPOCH]: 4998, [LOSS]: 0.000000
    [EPOCH]: 4999, [LOSS]: 0.000000
    

    Both of these bugs fail silently, which makes debugging them difficult. Unfortunately, these kinds of bugs are very easy to come across when doing machine learning. I highly recommend reading this blog post on best practices when training neural networks to minimize the risk of silent bugs.


    Full code:

    import torch
    import numpy as np
    
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    
    N = 1000  # number of samples
    D = 2  # input dimension
    C = 1  # output dimension
    
    X = torch.rand(N, D).to(device)  # (N, D)
    y = torch.sum(X, axis=-1).reshape(-1, C)  # (N, C)
    
    lr = 1e-2  # Learning rate
    
    model = torch.nn.Sequential(torch.nn.Linear(D, C))  # model
    model.to(device)
    
    criterion = torch.nn.MSELoss()  # loss function
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)  # optimizer
    
    for epoch in range(1000):
        y_pred = model(X)  # forward step
        loss = criterion(y_pred, y)  # compute loss
        loss.backward()  # backprop (compute gradients)
        optimizer.step()  # update weights (gradient descent step)
        optimizer.zero_grad()  # reset gradients
        if epoch % 50 == 0:
            print(f"[EPOCH]: {epoch}, [LOSS]: {loss.item():.6f}")