optimizationpytorchprobabilitysimplex

Projected gradient descent on probability simplex in pytorch


I have a matrix A of dimension 1000x70000. my loss function includes A and I want to find optimal value of A using gradient descent where the constraint is that the rows of A remain in probability simplex (i.e. every row sums up to 1). I have initialised A as given below

A=np.random.dirichlet(np.ones(70000),1000)
A=torch.tensor(A,requires_grad=True)

and my training loop looks like as given below

for epoch in range(500):
    y_pred=forward(X)
    y=model(torch.mm(A.float(),X))
    l=loss(y,y_pred)
    l.backward()
    A.grad.data=-A.grad.data
    optimizer.step()
    optimizer.zero_grad()
    if epoch%2==0:
        print("Loss",l,"\n")

Solution

  • An easy way to accomplish that is not to use A directly for computation but use a row normalized version of A.

    # you can keep 'A' unconstrained
    A = torch.rand(1000, 70000, requires_grad=True)
    

    then divide each row by its summation (keeping row sum always 1)

    for epoch in range(500):
        y_pred = forward(X)
        B = A / A.sum(-1, keepdim=True) # normalize rows manually
        y = model(torch.mm(B, X))
        l = loss(y,y_pred)
        ...
    

    So now, at each step, B is the constrained matrix - i.e. the quantity of your interest. However, the optimization would still be on (unconstrained) A.


    Edit: @Umang Gupta remined me in the comment section that OP wanted to have "probability simplex" which means there would be another constraint, i.e. A >= 0.

    To accomplish that, you may simply apply some appropriate activation function (e.g. torch.exp, torch.sigmoid) on A in each iteration

    A_ = torch.exp(A)
    B = A_ / A_.sum(-1, keepdim=True) # normalize rows
    

    the exact choice of function depends on the behaviour of training dynamics which needs to be experimented with.