I have recently been reading the Wavenet and PixelCNN papers, and in both of them they mention that using gated activation functions work better than a ReLU. But in neither cases they offer an explanation as to why that is.
I have asked on other platforms (like on r/machinelearning) but I have not gotten any replies so far. Might it be that they just tried (by chance) this replacement and it turned out to yield favorable results?
Function for reference: y = tanh(Wk,f ∗ x) . σ(Wk,g ∗ x)
Element-wise multiplication between the sigmoid and tanh of the convolution.
I did some digging and talked some more with a friend, who pointed me towards a paper by Dauphin et. al. about "Language Modeling with Gated Convolutional Networks". He offers a good explanation on this topic in section 3 of the paper:
LSTMs enable long-term memory via a separate cell controlled by input and forget gates. This allows information to flow unimpeded through potentially many timesteps. Without these gates, information could easily vanish through the transformations of each timestep.
In contrast, convolutional networks do not suffer from the same kind of vanishing gradient and we find experimentally that they do not require forget gates. Therefore, we consider models possessing solely output gates, which allow the network to control what information should be propagated through the hierarchy of layers.
In other terms, that means, that they adopted the concept of gates and applied them to sequential convolutional layers, to control what type of information is being let through, and apparently this works better than using a ReLU.
edit: But WHY it works better, I still don't know, if anyone could give me an even remotely intuitive answer I would be grateful, I looked around a bit more, and apparently we are still basing our judgement on trial and error.