The following code is just a template, you see the following pattern a lot in AI codes.
I have a specific question about loss.backward()
. in the following code we have a model
, as we pass model.parameters()
to optimizer
so optimizer
and model
are some how connected. But there is no connection betweenloss_fn
and model
or loss_fn
and optimizer
. So how exactly loss.backward()
works?
I mean, consider I add a new instance of MSELoss
like loss_fn_2 = torch.nn.MSELoss(reduction='sum')
to the code and exactly do the same loss_2 = loss_fn_2(y_pred, y)
and loss_2.backward()
How pytorch recognize that loss_2
is not related to model
and only loss
is related?
Consider a scenario, I would like to have (model_a
or loss_fn_a
and optimizer_a
) and (model_b
or loss_fn_b
and optimizer_b
) so I would like to make *_a
and *_b
isolated from each other
import torch
import math
# Create Tensors to hold input and outputs.
x = torch.linspace(-math.pi, math.pi, 2000)
y = torch.sin(x)
# Prepare the input tensor (x, x^2, x^3).
p = torch.tensor([1, 2, 3])
xx = x.unsqueeze(-1).pow(p)
# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
torch.nn.Linear(3, 1),
torch.nn.Flatten(0, 1)
)
loss_fn = torch.nn.MSELoss(reduction='sum')
# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use RMSprop; the optim package contains many other
# optimization algorithms. The first argument to the RMSprop constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-3
optimizer = torch.optim.RMSprop(model.parameters(), lr=learning_rate)
for t in range(2000):
# Forward pass: compute predicted y by passing x to the model.
y_pred = model(xx)
# Compute and print loss.
loss = loss_fn(y_pred, y)
if t % 100 == 99:
print(t, loss.item())
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the variables it will update (which are the learnable
# weights of the model). This is because by default, gradients are
# accumulated in buffers( i.e, not overwritten) whenever .backward()
# is called. Checkout docs of torch.autograd.backward for more details.
optimizer.zero_grad()
# Backward pass: compute gradient of the loss with respect to model
# parameters
loss.backward()
# Calling the step function on an Optimizer makes an update to its
# parameters
optimizer.step()
linear_layer = model[0]
print(f'Result: y = {linear_layer.bias.item()} + {linear_layer.weight[:, 0].item()} x + {linear_layer.weight[:, 1].item()} x^2 + {linear_layer.weight[:, 2].item()} x^3')
as it may seem it's not connected but loss is actually connected to the model. Let's imagine a model is a large mathematical/non linear function. (For simple explanation imagine a linear regression model which is a function in the lines of y = mx+b).
Let's imagine a model something like:
x1 = torch.tensor(2, requires_grad=True, dtype=torch.float16)
x2 = torch.tensor(3, requires_grad=True, dtype=torch.float16)
x3 = torch.tensor(1, requires_grad=True, dtype=torch.float16)
x4 = torch.tensor(4, requires_grad=True, dtype=torch.float16)
z1 = x1 * x2
z2 = x3 * x4
f = z1 + z2
f.backward()
print(f'gradient of x1 = {x1.grad}')
# output: gradient of x1 = 3.0
# df_dx1 = x1 * 3 + 1 * 4
If you observe the above code, it is similar to the huge non linear function we have constructed using neural networks. Essentially loss is the output of this function, just like f which is the output of x1x2 + x3x4, so when you call loss.backward(), it means you are going back to calculating gradient of the tensors involved to get the loss output(torch internally handles this for you)
So whatever is involved in reaching the loss value calculation is only affected in backward which means *_a is isolated from *_b.