add Batch Normalization immediately before non-linearity or after in Keras?

Asked 21/2, 2017 at 6:6 Answered 15/1, 2021 at 17:49

def conv2d_bn(x, nb_filter, nb_row, nb_col,
              border_mode='same', subsample=(1, 1),
              name=None):
    '''Utility function to apply conv + BN.
    '''

    x = Convolution2D(nb_filter, nb_row, nb_col,
                      subsample=subsample,
                      activation='relu',
                      border_mode=border_mode,
                      name=conv_name)(x)
    x = BatchNormalization(axis=bn_axis, name=bn_name)(x)
    return x

When I use official inception_v3 model in keras, I find that they use BatchNormalization after 'relu' nonlinearity as above code script.

But in the Batch Normalization paper, the authors said

we add the BN transform immediately before the nonlinearity, by normalizing x=Wu+b.

Then I view the implementation of inception in tensorflow which add BN immediately before the nonlinearity as they said. For more details in inception ops.py

I'm confused. Why do people use above style in Keras other than the following?

def conv2d_bn(x, nb_filter, nb_row, nb_col,
              border_mode='same', subsample=(1, 1),
              name=None):
    '''Utility function to apply conv + BN.
    '''

    x = Convolution2D(nb_filter, nb_row, nb_col,
                      subsample=subsample,
                      border_mode=border_mode,
                      name=conv_name)(x)
    x = BatchNormalization(axis=bn_axis, name=bn_name)(x)
    x = Activation('relu')(x)
    return x

In the Dense case:

x = Dense(1024, name='fc')(x)
x = BatchNormalization(axis=bn_axis, name=bn_name)(x)
x = Activation('relu')(x)

Doble answered 21/2, 2017 at 6:6 Comment(0)

I also use it before the activation, which is indeed how it was designed, and so do other libraries, such as lasagne's batch_norm http://lasagne.readthedocs.io/en/latest/modules/layers/normalization.html#lasagne.layers.batch_norm .

However it seems that in practice placing it after the activation works a bit better:

https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md (this is just one benchmark though)

Monroy answered 22/3, 2017 at 11:6 Comment(0)

In addition to the original paper using batch normalization before the activation, Bengio's book Deep Learning, section 8.7.1 gives some reasoning for why applying batch normalization after the activation (or directly before the input to the next layer) may cause some issues:

It is natural to wonder whether we should apply batch normalization to the input X, or to the transformed value XW+b. Ioﬀe and Szegedy (2015) recommend the latter. More speciﬁcally, XW+b should be replaced by a normalized version of XW. The bias term should be omitted because it becomes redundant with the β parameter applied by the batch normalization reparameterization. The input to a layer is usually the output of a nonlinear activation function such as the rectiﬁed linear function in a previous layer. The statistics of the input are thus more non-Gaussian and less amenable to standardization by linear operations.

In other words, if we use a relu activation, all negative values are mapped to zero. This will likely result in a mean value that is already very close to zero, but the distribution of the remaining data will be heavily skewed to the right. Trying to normalize that data to a nice bell-shaped curve probably won't give the best results. For activations outside of the relu family this may not be as big of an issue.

Keep in mind that there have been reports of models getting better results when using batch normalization after the activation, so it is probably worthwhile to test your model using both configurations.

Aldosterone answered 15/1, 2021 at 17:49 Comment(0)

Recommended topics

Hot tags