pythonpytorchloss-functiongradient-descentsgd

How loss_fn connected to model and optimizer in pytorch


The following code is just a template, you see the following pattern a lot in AI codes. I have a specific question about loss.backward(). in the following code we have a model, as we pass model.parameters() to optimizer so optimizer and model are some how connected. But there is no connection betweenloss_fn and model or loss_fn and optimizer. So how exactly loss.backward() works?

I mean, consider I add a new instance of MSELoss like loss_fn_2 = torch.nn.MSELoss(reduction='sum') to the code and exactly do the same loss_2 = loss_fn_2(y_pred, y) and loss_2.backward()

How pytorch recognize that loss_2 is not related to model and only loss is related?

Consider a scenario, I would like to have (model_a or loss_fn_a and optimizer_a) and (model_b or loss_fn_b and optimizer_b) so I would like to make *_a and *_b isolated from each other

import torch
import math


# Create Tensors to hold input and outputs.
x = torch.linspace(-math.pi, math.pi, 2000)
y = torch.sin(x)

# Prepare the input tensor (x, x^2, x^3).
p = torch.tensor([1, 2, 3])
xx = x.unsqueeze(-1).pow(p)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(3, 1),
    torch.nn.Flatten(0, 1)
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use RMSprop; the optim package contains many other
# optimization algorithms. The first argument to the RMSprop constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-3
optimizer = torch.optim.RMSprop(model.parameters(), lr=learning_rate)
for t in range(2000):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(xx)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()


linear_layer = model[0]
print(f'Result: y = {linear_layer.bias.item()} + {linear_layer.weight[:, 0].item()} x + {linear_layer.weight[:, 1].item()} x^2 + {linear_layer.weight[:, 2].item()} x^3')

Solution

  • as it may seem it's not connected but loss is actually connected to the model. Let's imagine a model is a large mathematical/non linear function. (For simple explanation imagine a linear regression model which is a function in the lines of y = mx+b).

    Let's imagine a model something like:

    x1 = torch.tensor(2, requires_grad=True, dtype=torch.float16)
    x2 = torch.tensor(3, requires_grad=True, dtype=torch.float16)
    x3 = torch.tensor(1, requires_grad=True, dtype=torch.float16)
    x4 = torch.tensor(4, requires_grad=True, dtype=torch.float16)
    
    z1 = x1 * x2
    z2 = x3 * x4
    
    f = z1 + z2
    
    f.backward()
    print(f'gradient of x1 = {x1.grad}')
    # output: gradient of x1 = 3.0
    # df_dx1 = x1 * 3 + 1 * 4
    

    If you observe the above code, it is similar to the huge non linear function we have constructed using neural networks. Essentially loss is the output of this function, just like f which is the output of x1x2 + x3x4, so when you call loss.backward(), it means you are going back to calculating gradient of the tensors involved to get the loss output(torch internally handles this for you)

    So whatever is involved in reaching the loss value calculation is only affected in backward which means *_a is isolated from *_b.