Why does the gated activation function (used in Wavenet) work better than a ReLU?
Asked Answered
D

3

13

I have recently been reading the Wavenet and PixelCNN papers, and in both of them they mention that using gated activation functions work better than a ReLU. But in neither cases they offer an explanation as to why that is.

I have asked on other platforms (like on r/machinelearning) but I have not gotten any replies so far. Might it be that they just tried (by chance) this replacement and it turned out to yield favorable results?

Function for reference: y = tanh(Wk,f ∗ x) . σ(Wk,g ∗ x)

Element-wise multiplication between the sigmoid and tanh of the convolution.

Devout answered 9/5, 2019 at 14:18 Comment(0)
D
10

I did some digging and talked some more with a friend, who pointed me towards a paper by Dauphin et. al. about "Language Modeling with Gated Convolutional Networks". He offers a good explanation on this topic in section 3 of the paper:

LSTMs enable long-term memory via a separate cell controlled by input and forget gates. This allows information to flow unimpeded through potentially many timesteps. Without these gates, information could easily vanish through the transformations of each timestep.

In contrast, convolutional networks do not suffer from the same kind of vanishing gradient and we find experimentally that they do not require forget gates. Therefore, we consider models possessing solely output gates, which allow the network to control what information should be propagated through the hierarchy of layers.

In other terms, that means, that they adopted the concept of gates and applied them to sequential convolutional layers, to control what type of information is being let through, and apparently this works better than using a ReLU.

edit: But WHY it works better, I still don't know, if anyone could give me an even remotely intuitive answer I would be grateful, I looked around a bit more, and apparently we are still basing our judgement on trial and error.

Devout answered 9/5, 2019 at 14:59 Comment(1)
did you ever find out?Maneating
T
0

I believe it's because it's highly non-linear near zero, unlike relu. With their new activation function [tanh(W1 * x) * sigmoid(W2 * x)] you get a function that has some interesting bends in the [-1,1] range.

Don't forget that this isn't operating on the feature space, but on a matrix multiplication of the feature space, so it's not just "bigger feature values do this, smaller feature values do that" but rather it operates on the outputs of a linear transform of the feature space.

Basically it chooses regions to highlight, regions to ignore, and does so flexibly (and non-linearly) thanks to the activation.

https://www.desmos.com/calculator/owmzbnonlh , see "c" function.

This allows the model to separate the data in the gated attention space.

That's my understanding of it but it is still pretty alchemical to me as well.

Tetanus answered 13/6, 2022 at 9:49 Comment(0)
I
0

Part of it may be trial and error like many other architectural tweaks in deep learning. In this case, I believe it was shown in the PixelRNN paper that PixelRNN outperformed PixelCNN. So the authors were looking for reasons why, and they decided that PixelRNN's multiplicative activation units ("gates" in the LSTM) is allowing it to model more complex interactions. One explanation of why that may be the case that I've come across is the following:

Sigmoid's output is between 0 and 1, so it acts as a gating function deciding whether the information passes through (zeroing out the information if needed) and the tanh with its wider range (-1, 1) carries the actual information through. So their combination ends up being more expressive than single activation function (e.g. ReLU).

Source: Pieter Abbeel's Deep Unsupervised Learning course - YT Video

Itchy answered 7/3, 2023 at 4:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.