I have a matrix A of dimension 1000x70000. my loss function includes A and I want to find optimal value of A using gradient descent where the constraint is that the rows of A remain in probability simplex (i.e. every row sums up to 1). I have initialised A as given below
A=np.random.dirichlet(np.ones(70000),1000)
A=torch.tensor(A,requires_grad=True)
and my training loop looks like as given below
for epoch in range(500):
y_pred=forward(X)
y=model(torch.mm(A.float(),X))
l=loss(y,y_pred)
l.backward()
A.grad.data=-A.grad.data
optimizer.step()
optimizer.zero_grad()
if epoch%2==0:
print("Loss",l,"\n")
An easy way to accomplish that is not to use A
directly for computation but use a row normalized version of A
.
# you can keep 'A' unconstrained
A = torch.rand(1000, 70000, requires_grad=True)
then divide each row by its summation (keeping row sum always 1)
for epoch in range(500):
y_pred = forward(X)
B = A / A.sum(-1, keepdim=True) # normalize rows manually
y = model(torch.mm(B, X))
l = loss(y,y_pred)
...
So now, at each step, B
is the constrained matrix - i.e. the quantity of your interest. However, the optimization would still be on (unconstrained) A
.
Edit: @Umang Gupta remined me in the comment section that OP wanted to have "probability simplex" which means there would be another constraint, i.e. A >= 0
.
To accomplish that, you may simply apply some appropriate activation function (e.g. torch.exp
, torch.sigmoid
) on A
in each iteration
A_ = torch.exp(A)
B = A_ / A_.sum(-1, keepdim=True) # normalize rows
the exact choice of function depends on the behaviour of training dynamics which needs to be experimented with.