pythonpytorchderivativeautomatic-differentiation

Derivative of Scalar Expansion in PyTorch


I'm working on implementing a very simple auto diff library in Rust to expand my knowledge on how it is done. I have most everything working, but when implementing negative log likelihood, I realized that I have some confusion on how to handle the derivative for the following scenario (I've written it in PyTorch below).

x = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grads=True)
y = x - torch.sum(x)

I've looked around, experimented, and am still a little confused on what is actually happening here. I know that the derivative with respect to x of the equation above is [-2, -2, -2], but there are a number of ways to get there, and when I expand the equation to the following:

x = torch.tensor([1, 2, 3], dtype=torch.float32, requires_grads=True)
y = torch.exp(x - torch.sum(x))

I am completely lost and have no idea how it derived the gradients for x.

I'm assuming the above equations are being rewritten to something like this:

y = (x - [torch.sum(x), torch.sum(x), torch.sum(x)])

but am not sure, and am really struggling to find info on the topic of scalars being expanded to vectors or whatever is actually happening. If someone could point me in the right direction that would be awesome!

If helpful, I can include the gradients pytorch computes of the above equations.


Solution

  • Your code won't work with PyTorch without any modification because it doesn't specify what the gradients w.r.t to y are. You need them to call y.backward() which computes the gradients w.r.t to x. From your all -2 result, I figured the gradients must be all ones.

    The "scalar expansion" is called broadcasting. As you already know, broadcasting is performed whenever two tensor operands have mismatched shapes. My guess is that it is implemented in the same way as any other operation in PyTorch that knows how to compute the gradients w.r.t its inputs given the gradients w.r.t its outputs. A simple example is given below which (a) works with your given test case and (b) allows us to still use PyTorch's autograd to compute the gradients automatically (see also PyTorch's docs on extending autograd):

    class Broadcast(torch.autograd.Function):
        def forward(ctx, x: torch.Tensor, length: int) -> torch.Tensor:
            assert x.ndim == 0, "input must be a scalar tensor"
            assert length > 0, "length must be greater than zero"
            return x.new_full((length,), x.item())
    
        def backward(ctx, grad: torch.Tensor) -> Tuple[torch.Tensor, None]:
            return grad.sum(), None
    

    Now, by setting broadcast = Broadcast.apply we can call the broadcasting ourselves instead of letting PyTorch perform it automatically.

    x = torch.tensor([1., 2., 3.], requires_grad=True)
    y = x - broadcast(torch.sum(x), x.size(0))
    y.backward(torch.ones_like(y))
    assert torch.allclose(torch.tensor(-2.), x.grad)
    

    Note that I don't know how PyTorch actually implements it. The implementation above is just to illustrate how the broadcasting operation might be written for automatic differentiation to work, which hopefully answers your question.