deep-learningnlpsoftmaxattention-model

why softmax get small gradient when the value is large in paper 'Attention is all you need'


This is the screen of the original paper: the screen of the paper. I understand the meaning of the paper is that when the value of dot-product is large, the gradient of softmax will get very small.
However, I tried to calculate the gradient of softmax with the cross entropy loss and found that the gradient of softmax is not directly related to value passed to softmax.
Even the single value is large, it still can get a large gradient when ather values are large. (sorry about that I don't know how to pose the calculation process here)


Solution

  • Actually the gradient of cross entropy with softmax on a one hot encoding vector is just grad -log(softmax(x)) = (1 - softmax(x)) at the index of the vector of the corresponding class. (https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/). If the value passed to the softmax is large, the softmax will produce 1 and therefore produce 0 gradient.