machine-learningpytorchhessian

How to compute hessian matrix for all parameters in a network in pytorch?


Suppose vector \theta is all the parameters in a neural network, I wonder how to compute hessian matrix for \theta in pytorch.

Suppose the network is as follows:

class Net(Module):
    def __init__(self, h, w):
        super(Net, self).__init__()
        self.c1 = torch.nn.Conv2d(1, 32, 3, 1, 1)
        self.f2 = torch.nn.Linear(32 * h * w, 5)

    def forward(self, x):
        x = self.c1(x)
        x = x.view(x.size(0), -1)
        x = self.f2(x)
        return x

I know the second derivative can be calculated by calling torch.autograd.grad() twice, but the parameters in pytorch is organized by net.parameters(), and I don't know how to compute the hessian for all parameters.

I have tried to use torch.autograd.functional.hessian() in pytorch 1.5 as follows:

import torch
import numpy as np
from torch.nn import Module
import torch.nn.functional as F


class Net(Module):
    def __init__(self, h, w):
        super(Net, self).__init__()
        self.c1 = torch.nn.Conv2d(1, 32, 3, 1, 1)
        self.f2 = torch.nn.Linear(32 * h * w, 5)

    def forward(self, x):
        x = self.c1(x)
        x = x.view(x.size(0), -1)
        x = self.f2(x)
        return x


def func_(a, b c, d):
    p = [a, b, c, d]
    x = torch.randn(size=[8, 1, 12, 12], dtype=torch.float32)
    y = torch.randint(0, 5, [8])
    x = F.conv2d(x, p[0], p[1], 1, 1)
    x = x.view(x.size(0), -1)
    x = F.linear(x, p[2], p[3])
    loss = F.cross_entropy(x, y)
    return loss


if __name__ == '__main__':
    net = Net(12, 12)

    h = torch.autograd.functional.hessian(func_, tuple([_ for _ in net.parameters()]))
    print(type(h), len(h))

h is a tuple, and the results are in strange shape. For example, the shape of \frac{\delta Loss^2}{\delta c1.weight^2} is [32,1,3,3,32,1,3,3]. It seems like I can combine them into a complete H, but I don't know which part it is in the whole Hessian Matrix and the corresponding order.


Solution

  • Here is one solution, I think it's a little too complex but could be instructive.

    Considering about these points:

    1. First, about torch.autograd.functional.hessian() the first argument must be a function, and the second argument should be a tuple or list of tensors. That means we cannot directly pass a scalar loss to it. (I don't know why, because I think there is no large difference between a scalar loss or a function that returns a scalar)
    2. Second, I want to obtain a complete Hessian matrix, which is the second derivative of all parameters, and it should be in an appropriate order.

    So here is the solution:

    import torch
    import numpy as np
    from torch.nn import Module
    import torch.nn.functional as F
    
    class Net(Module):
        def __init__(self, h, w):
            super(Net, self).__init__()
            self.c1 = torch.nn.Conv2d(1, 32, 3, 1, 1)
            self.f2 = torch.nn.Linear(32 * h * w, 5)
    
        def forward(self, x):
            x = self.c1(x)
            x = x.view(x.size(0), -1)
            x = self.f2(x)
            return x
    
    def haha(a, b, c, d):
        p = [a.view(32, 1, 3, 3), b, c.view(5, 32 * 12 * 12), d]
        x = torch.randn(size=[8, 1, 12, 12], dtype=torch.float32)
        y = torch.randint(0, 5, [8])
        x = F.conv2d(x, p[0], p[1], 1, 1)
        x = x.view(x.size(0), -1)
        x = F.linear(x, p[2], p[3])
        loss = F.cross_entropy(x, y)
        return loss
    
    
    if __name__ == '__main__':
        net = Net(12, 12)
    
        h = torch.autograd.functional.hessian(haha, tuple([_.view(-1) for _ in net.parameters()]))
        
        # Then we just need to fix tensors in h into a big matrix
    

    I build a new function haha that works in the same way with the neural network Net. Notice that arguments a, b, c, d are all expanded into one-dimensional vectors, so that the shapes of tensors in h are all two dimensional, in good order and easy to be combined into a large hessian matrix.

    In my example, the shapes of tensors in h is

    # with relation to c1.weight and c1.weight, c1.bias, f2.weight, f2.bias
    [288,288]
    [288,32]
    [288,23040]
    [288,5]
    
    # with relation to c2.bias and c1.weight, c1.bias, f2.weight, f2.bias
    [32, 288]
    [32, 32]
    [32, 23040]
    [32, 5]
    ...
    
    

    So it is easy to see the meaning of the tensors and which part it is. All we need to do is to allocate a (288+32+23040+5)*(288+32+23040+5) matrix and fix the tensors in h into the corresponding locations.

    I think the solution still could be improved, like we don't need to build a function works the same way with neural network, and transform the shape of parameters twice. But for now I don't have better ideas, if there is any better solution, please let me know.