pythonmachine-learningdeep-learningpytorchactivation-function

What is the best choice for an activation function in case of small sized neural networks


I am using pytorch and autograd to build my neural network architecture. It is a small 3 layered network with a sinngle input and output. Suppose I have to predict some output function based on some initial conditions and I am using a custom loss function.

The problem I am facing is:

  1. My loss converges initially but gradients vanish eventually.

  2. I have tried sigmoid activation and tanh. tanh gives slightly better results in terms of loss convergence.

  3. I tried using ReLU but since I don't have much weights in my neural network, the weights become dead and it doesn't give good results.

Is there any other activation function apart from sigmoid and tanh that handles the problem of vanishing gradients well enough for small sized neural networks? Any suggestions on what else can I try?


Solution

  • In the deep learning world, ReLU is usually prefered over other activation functions, because it overcomes the vanishing gradient problem, allowing models to learn faster and perform better. But it could have downsides.

    Dying ReLU problem

    The dying ReLU problem refers to the scenario when a large number of ReLU neurons only output values of 0. When most of these neurons return output zero, the gradients fail to flow during backpropagation and the weights do not get updated. Ultimately a large part of the network becomes inactive and it is unable to learn further.

    What causes the Dying ReLU problem?

    How to solve the Dying ReLU problem?

    Sources:

    Practical guide for ReLU

    ReLU variants

    Dying ReLU problem