machine-learningpytorchgradient-descent

Why is my sigmoid layer blocking gradients?


import torch
import torch.optim as optim
import torch.nn as nn

input = torch.tensor([1.,2.], requires_grad=True)
sigmoid = nn.Sigmoid()

interm = sigmoid(input)

optimizer = optim.SGD([input], lr=1, momentum=0.9)

for epoch in range(5):
    optimizer.zero_grad()
    loss = torch.linalg.vector_norm(interm - torch.tensor([2.,2.]))
    print(epoch, loss, input, interm)

    loss.backward(retain_graph=True)
    optimizer.step()
    print(interm.grad)

So I created this simplified example with an input going into a sigmoid as an intermediate activation function.

I am trying to find the input that results in interm = [2.,2.]

But the gradients are not passing through. Anyone know why?


Solution

  • Grads are computed for leaf tensors. In your example, input is a leaf tensor, while interm is not.

    When you try to access interm.grad, you should get the following error message:

    UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at aten/src/ATen/core/TensorBody.h:486.)

    This is because grads are propagated back to the leaf tensor input, not to interm. You can add interm.retain_grad() if you want to get the grad for the interm variable.

    However, even if you did this, there is nothing in your example that would cause the value of interm to change. Each optimizer step changes the input value, but this does not result in interm being recomputed. If you want interm to be updated, you need to recompute it each iteration with the new input value. ie:

    for epoch in range(5):
        optimizer.zero_grad()
        interm = sigmoid(input)
        interm.retain_grad()
        loss = torch.linalg.vector_norm(interm - torch.tensor([2.,2.]))
        print(epoch, loss, input, interm)
    
        loss.backward(retain_graph=True)
        optimizer.step()
        print(interm.grad)
    

    There's also a fundamental problem with what you are trying to do. You say you want the input that results in interm = [2., 2.]. However, you are computing interm = sigmoid(input). The sigmoid function is bounded between (0, 1). There is no such value of input that would result in interm = [2., 2.], because 2 is outside the range of the sigmoid function. If you ran your optimization loop indefinitely, you would get input = [inf, inf] and interm = [1., 1.].