Where do I call the BatchNormalization function in Keras?
Asked Answered
B

8

213

If I want to use the BatchNormalization function in Keras, then do I need to call it once only at the beginning?

I read this documentation for it: http://keras.io/layers/normalization/

I don't see where I'm supposed to call it. Below is my code attempting to use it:

model = Sequential()
keras.layers.normalization.BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None)
model.add(Dense(64, input_dim=14, init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(64, init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(2, init='uniform'))
model.add(Activation('softmax'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)
model.fit(X_train, y_train, nb_epoch=20, batch_size=16, show_accuracy=True, validation_split=0.2, verbose = 2)

I ask because if I run the code with the second line including the batch normalization and if I run the code without the second line I get similar outputs. So either I'm not calling the function in the right place, or I guess it doesn't make that much of a difference.

Balloon answered 11/1, 2016 at 7:47 Comment(0)
S
270

Just to answer this question in a little more detail, and as Pavel said, Batch Normalization is just another layer, so you can use it as such to create your desired network architecture.

The general use case is to use BN between the linear and non-linear layers in your network, because it normalizes the input to your activation function, so that you're centered in the linear section of the activation function (such as Sigmoid). There's a small discussion of it here

In your case above, this might look like:


# import BatchNormalization
from keras.layers.normalization import BatchNormalization

# instantiate model
model = Sequential()

# we can think of this chunk as the input layer
model.add(Dense(64, input_dim=14, init='uniform'))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.5))

# we can think of this chunk as the hidden layer    
model.add(Dense(64, init='uniform'))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.5))

# we can think of this chunk as the output layer
model.add(Dense(2, init='uniform'))
model.add(BatchNormalization())
model.add(Activation('softmax'))

# setting up the optimization of our weights 
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)

# running the fitting
model.fit(X_train, y_train, nb_epoch=20, batch_size=16, show_accuracy=True, validation_split=0.2, verbose = 2)

Hope this clarifies things a bit more.

Stark answered 22/6, 2016 at 22:40 Comment(13)
FYI apparently batch normalization works better in practice after the activation functionBucktooth
Hi @Claudiu, would you mind expanding on this FYI? It appears to directly contradict the answer above.Calibre
@benogorek: sure thing, basically I based it entirely on the results here where placing the batch norm after the relu performed better. FWIW I haven't had success applying it one way or the other on the one net I've triedBucktooth
Interesting. Just to follow up, if you continue to read on in that summary, it says that their best model [GoogLeNet128_BN_lim0606] actually has the BN layer BEFORE the ReLU. So while BN after Activation might improve accuracy in an isolated case, when the whole model is constructed, before performed best. Likely it's possible that placing BN after Activation could improve accuracy, but is likely problem dependent.Stark
Wow, that benchmark is really interesting. Does anyone have any intuition regarding what the hell is going on there? Why would it be better to offset and scale activations after a nonlinearity? Is it that the betas and gammas have to deal with less variety or something, and thus the model generalizes better when training data is not abundant?Cookshop
@CarlThomé kind of. See this reddit comment by ReginaldIII for instance. They state: "BN is normalizing the distribution of features coming out of a convolution, some [of] these features might be negative [and] truncated by a non-linearity like ReLU. If you normalize before activation you are including these negative values in the normalization immediately before culling them from the feature space. BN after activation will normalize the positive features without statistically biasing them with features that do not make it through to the next convolutional layer."Crippling
However, ReginaldIII's argument might not be applicable to sigmoids and other non-linearities with large saturated areas, like indirectly suggested in this tutorial.Crippling
According to this paper arxiv.org/abs/1801.05134, one should not use batch normalization after dropout. Currently, I have no opinion on this issue, but what is your experience? Also, use_bias=False should be implemented in the dense layers.Bibliophile
I think also that batch normalization after activation would make sense if you're applying regularization in the subsequent layer, this way the regularization will not penalize neurons differently based on their variance.Upi
Does it make any sense to but batch normalization before softmax? There is no next conv or dense layer for which it would be beneficialDiscompose
@Bucktooth oh waw what u said is very true in my case I got 58% by applying the batch_normalization before activation function and 65% when apply it after it. I wonder why ?Somali
Importing batchnorm should be from keras.layers import BatchNormalization I guess?Gabbert
yess "from keras.layers import BatchNormalization"Spastic
B
76

This thread is misleading. Tried commenting on Lucas Ramadan's answer, but I don't have the right privileges yet, so I'll just put this here.

Batch normalization works best after the activation function, and here or here is why: it was developed to prevent internal covariate shift. Internal covariate shift occurs when the distribution of the activations of a layer shifts significantly throughout training. Batch normalization is used so that the distribution of the inputs (and these inputs are literally the result of an activation function) to a specific layer doesn't change over time due to parameter updates from each batch (or at least, allows it to change in an advantageous way). It uses batch statistics to do the normalizing, and then uses the batch normalization parameters (gamma and beta in the original paper) "to make sure that the transformation inserted in the network can represent the identity transform" (quote from original paper). But the point is that we're trying to normalize the inputs to a layer, so it should always go immediately before the next layer in the network. Whether or not that's after an activation function is dependent on the architecture in question.

Bluebell answered 10/8, 2017 at 22:17 Comment(4)
I just saw in the deeplearning.ai class that Andrew Ng says that there is a debate on this in the Deep Learning community. He prefers applying batch normalization before the non-linearity.Mcdaniel
@kRazzyR I meant that Prof. Andrew Ng talked about this topic in his deep learning classes on deeplearning.ai He said that the community is divided on the right way of doing things and that he preferred applying batch normalization before applying the non-linearity.Mcdaniel
@jmancuso, BN is applied before activation. From the paper itself, equation is g(BN(Wx + b)) , where g is the activation function.Dread
Before are after is fitting to test. Nobody knows in advance which is better practically. But theoretically, yes, before non-linearity makes more sense.Albertinaalbertine
V
53

This thread has some considerable debate about whether BN should be applied before non-linearity of current layer or to the activations of the previous layer.

Although there is no correct answer, the authors of Batch Normalization say that It should be applied immediately before the non-linearity of the current layer. The reason ( quoted from original paper) -

"We add the BN transform immediately before the nonlinearity, by normalizing x = Wu+b. We could have also normalized the layer inputs u, but since u is likely the output of another nonlinearity, the shape of its distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift. In contrast, Wu + b is more likely to have a symmetric, non-sparse distribution, that is “more Gaussian” (Hyv¨arinen & Oja, 2000); normalizing it is likely to produce activations with a stable distribution."

Versify answered 20/8, 2017 at 12:54 Comment(1)
In my own personal experience, it doesn't make a huge difference, but all else being equal, I've always seen BN perform slightly better when batch normalization is applied before the non-linearity (before the activation function).Somaliland
G
35

Keras now supports the use_bias=False option, so we can save some computation by writing like

model.add(Dense(64, use_bias=False))
model.add(BatchNormalization(axis=bn_axis))
model.add(Activation('tanh'))

or

model.add(Convolution2D(64, 3, 3, use_bias=False))
model.add(BatchNormalization(axis=bn_axis))
model.add(Activation('relu'))
Guttle answered 29/12, 2016 at 7:42 Comment(4)
hows model.add(BatchNormalization()) different from model.add(BatchNormalization(axis=bn_axis)) Shaum
@kRazzR it doesn't differ if you are using tensorflow as backend. It's written here because he copied this from the keras.applications module, where bn_axis needs to be specified in order to support both channels_first and channels_last formats.Woodchuck
Can someone please elaborate how this relates to the OP question? (I am rather beginner to NNs so maybe I miss something.)Chrominance
This answer is irrelevant to the OP question.Gabbert
C
33

It's almost become a trend now to have a Conv2D followed by a ReLu followed by a BatchNormalization layer. So I made up a small function to call all of them at once. Makes the model definition look a whole lot cleaner and easier to read.

def Conv2DReluBatchNorm(n_filter, w_filter, h_filter, inputs):
    return BatchNormalization()(Activation(activation='relu')(Convolution2D(n_filter, w_filter, h_filter, border_mode='same')(inputs)))
Cacilia answered 14/12, 2016 at 16:2 Comment(1)
maybe push this to keras?Ferrick
D
9

Batch Normalization is used to normalize the input layer as well as hidden layers by adjusting mean and scaling of the activations. Because of this normalizing effect with additional layer in deep neural networks, the network can use higher learning rate without vanishing or exploding gradients. Furthermore, batch normalization regularizes the network such that it is easier to generalize, and it is thus unnecessary to use dropout to mitigate overfitting.

Right after calculating the linear function using say, the Dense() or Conv2D() in Keras, we use BatchNormalization() which calculates the linear function in a layer and then we add the non-linearity to the layer using Activation().

from keras.layers.normalization import BatchNormalization
model = Sequential()
model.add(Dense(64, input_dim=14, init='uniform'))
model.add(BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(64, init='uniform'))
model.add(BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(2, init='uniform'))
model.add(BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None))
model.add(Activation('softmax'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)
model.fit(X_train, y_train, nb_epoch=20, batch_size=16, show_accuracy=True, 
validation_split=0.2, verbose = 2)

How is Batch Normalization applied?

Suppose we have input a[l-1] to a layer l. Also we have weights W[l] and bias unit b[l] for the layer l. Let a[l] be the activation vector calculated(i.e. after adding the non-linearity) for the layer l and z[l] be the vector before adding non-linearity

  1. Using a[l-1] and W[l] we can calculate z[l] for the layer l
  2. Usually in feed-forward propagation we will add bias unit to the z[l] at this stage like this z[l]+b[l], but in Batch Normalization this step of addition of b[l] is not required and no b[l] parameter is used.
  3. Calculate z[l] means and subtract it from each element
  4. Divide (z[l] - mean) using standard deviation. Call it Z_temp[l]
  5. Now define new parameters γ and β that will change the scale of the hidden layer as follows:

    z_norm[l] = γ.Z_temp[l] + β

In this code excerpt, the Dense() takes the a[l-1], uses W[l] and calculates z[l]. Then the immediate BatchNormalization() will perform the above steps to give z_norm[l]. And then the immediate Activation() will calculate tanh(z_norm[l]) to give a[l] i.e.

a[l] = tanh(z_norm[l])
Disaccharide answered 9/4, 2019 at 8:8 Comment(0)
R
7

It is another type of layer, so you should add it as a layer in an appropriate place of your model

model.add(keras.layers.normalization.BatchNormalization())

See an example here: https://github.com/fchollet/keras/blob/master/examples/kaggle_otto_nn.py

Richards answered 12/1, 2016 at 18:31 Comment(3)
After I added BatchNormalization, the val_acc stopped increasing every epoch. The val_acc stayed stagnant at the same number after every epoch after I added BatchNormalization. I thought Batch Normalization was supposed to increase the val_acc. How do I know if it is working properly? Do you know what may have caused this?Balloon
unfortunately the link is no longer valid :(Trexler
There are copies of that example in forks of Keras (e.g. github.com/WenchenLi/kaggle/blob/master/otto/keras/…), but I don't know why it was removed from the original Keras repo, and if the code is compatible with latest Keras versions.Richards
T
1

Adding another entry for the debate about whether batch normalization should be called before or after the non-linear activation:

In addition to the original paper using batch normalization before the activation, Bengio's book Deep Learning, section 8.7.1 gives some reasoning for why applying batch normalization after the activation (or directly before the input to the next layer) may cause some issues:

It is natural to wonder whether we should apply batch normalization to the input X, or to the transformed value XW+b. Ioffe and Szegedy (2015) recommend the latter. More specifically, XW+b should be replaced by a normalized version of XW. The bias term should be omitted because it becomes redundant with the β parameter applied by the batch normalization reparameterization. The input to a layer is usually the output of a nonlinear activation function such as the rectified linear function in a previous layer. The statistics of the input are thus more non-Gaussian and less amenable to standardization by linear operations.

In other words, if we use a relu activation, all negative values are mapped to zero. This will likely result in a mean value that is already very close to zero, but the distribution of the remaining data will be heavily skewed to the right. Trying to normalize that data to a nice bell-shaped curve probably won't give the best results. For activations outside of the relu family this may not be as big of an issue.

Keep in mind that there are reports of models getting better results when using batch normalization after the activation, while others get best results when the batch normalization is placed before the activation. It is probably best to test your model using both configurations, and if batch normalization after activation gives a significant decrease in validation loss, use that configuration instead.

Toulon answered 15/1, 2021 at 18:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.