I am learning Pytorch, and I am trying to implement a really simple network which takes an input which is of length 2, i.e. a point in the plane, and aims to learn the sum of its components.
In principle the network should just learn a linear layer with weight matrix W = [1.,1.] and zero bias, so I expect to have very low training error. However, I don't see why I do not get this as expected.
The code I am writing is this:
import torch
from torch import nn, optim
import numpy as np
device = torch.device("cuda:0" if
torch.cuda.is_available() else "cpu")
N = 1000 # number of samples
D = 2 # input dimension
C = 1 # output dimension
def model(z):
q = z[:,0]
p = z[:,1]
return q+p
X = torch.rand(N, D,requires_grad=True).to(device)
y = model(X)
lr = 1e-2 #Learning rate
Rete = nn.Sequential(nn.Linear(D, C))
Rete.to(device) #Convert to CUDA
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(Rete.parameters(), lr=lr)
for t in range(5000):
y_pred = Rete(X)
loss = criterion(y_pred, y)
print("[EPOCH]: %i, [LOSS]: %.6f" % (t, loss.item()))
optimizer.zero_grad()
optimizer.step()
There are 2 problems.
The first problem is that you forgot to backpropogate the loss:
optimizer.zero_grad()
loss.backward() # you forgot this step
optimizer.step()
It is important that optimizer.zero_grad()
IS NOT put in between loss.backwards()
and optimizer.step()
, otherwise you'll be resetting the gradients before performing the gradient descent step. In general, it is advisable to either put optimizer.zero_grad()
at the very beginning of your loop, or right after you call optimizer.step()
:
loss.backward()
optimizer.step()
optimizer.zero_grad() # this should go right after .step(), or at the very beginning of your training loop
Even after this change, you'll notice that your model still doesn't converge:
[EPOCH]: 0, [LOSS]: 0.232405
[EPOCH]: 1, [LOSS]: 0.225010
[EPOCH]: 2, [LOSS]: 0.218473
...
[EPOCH]: 4997, [LOSS]: 0.178762
[EPOCH]: 4998, [LOSS]: 0.178762
[EPOCH]: 4999, [LOSS]: 0.178762
This leads us to the second problem, the shapes of the output (y_pred
) and labels (y
) do not match. y_pred
has shape (N, C)
but y
has shape (N,)
. To fix this, just reshape y
to match y_pred
:
y = y.reshape(-1, C)
Then our model will converge:
[EPOCH]: 0, [LOSS]: 1.732189
[EPOCH]: 1, [LOSS]: 1.680017
[EPOCH]: 2, [LOSS]: 1.628712
...
[EPOCH]: 4997, [LOSS]: 0.000000
[EPOCH]: 4998, [LOSS]: 0.000000
[EPOCH]: 4999, [LOSS]: 0.000000
Both of these bugs fail silently, which makes debugging them difficult. Unfortunately, these kinds of bugs are very easy to come across when doing machine learning. I highly recommend reading this blog post on best practices when training neural networks to minimize the risk of silent bugs.
Full code:
import torch
import numpy as np
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
N = 1000 # number of samples
D = 2 # input dimension
C = 1 # output dimension
X = torch.rand(N, D).to(device) # (N, D)
y = torch.sum(X, axis=-1).reshape(-1, C) # (N, C)
lr = 1e-2 # Learning rate
model = torch.nn.Sequential(torch.nn.Linear(D, C)) # model
model.to(device)
criterion = torch.nn.MSELoss() # loss function
optimizer = torch.optim.Adam(model.parameters(), lr=lr) # optimizer
for epoch in range(1000):
y_pred = model(X) # forward step
loss = criterion(y_pred, y) # compute loss
loss.backward() # backprop (compute gradients)
optimizer.step() # update weights (gradient descent step)
optimizer.zero_grad() # reset gradients
if epoch % 50 == 0:
print(f"[EPOCH]: {epoch}, [LOSS]: {loss.item():.6f}")