[SOLVED] Projected gradient descent on probability simplex in pytorch

Projected gradient descent on probability simplex in pytorch

I have a matrix A of dimension 1000x70000. my loss function includes A and I want to find optimal value of A using gradient descent where the constraint is that the rows of A remain in probability simplex (i.e. every row sums up to 1). I have initialised A as given below

A=np.random.dirichlet(np.ones(70000),1000)
A=torch.tensor(A,requires_grad=True)

and my training loop looks like as given below

for epoch in range(500):
    y_pred=forward(X)
    y=model(torch.mm(A.float(),X))
    l=loss(y,y_pred)
    l.backward()
    A.grad.data=-A.grad.data
    optimizer.step()
    optimizer.zero_grad()
    if epoch%2==0:
        print("Loss",l,"\n")

Solution

An easy way to accomplish that is not to use A directly for computation but use a row normalized version of A.

# you can keep 'A' unconstrained
A = torch.rand(1000, 70000, requires_grad=True)

then divide each row by its summation (keeping row sum always 1)

for epoch in range(500):
    y_pred = forward(X)
    B = A / A.sum(-1, keepdim=True) # normalize rows manually
    y = model(torch.mm(B, X))
    l = loss(y,y_pred)
    ...

So now, at each step, B is the constrained matrix - i.e. the quantity of your interest. However, the optimization would still be on (unconstrained) A.

Edit: @Umang Gupta remined me in the comment section that OP wanted to have "probability simplex" which means there would be another constraint, i.e. A >= 0.

To accomplish that, you may simply apply some appropriate activation function (e.g. torch.exp, torch.sigmoid) on A in each iteration

A_ = torch.exp(A)
B = A_ / A_.sum(-1, keepdim=True) # normalize rows

the exact choice of function depends on the behaviour of training dynamics which needs to be experimented with.