What is the definition of a non-trainable parameter?
Asked Answered
T

4

44

What is the definition of non-trainable parameter in a model?

For example, while you are building your own model, its value is 0 as a default, but when you want to use an inception model, it is becoming something else rather than 0. What would be the reason behind it?

Traumatize answered 15/11, 2017 at 16:11 Comment(0)
E
29

Non-trainable parameters are quite a broad subject. A straightforward example is to consider the case of any specific NN model and its architecture.

Say we have already setup your network definition in Keras, and your architecture is something like 256->500->500->1. Based on this definition, we seem to have a Regression Model (one output) with two hidden layers (500 nodes each) and an input of 256.

One non-trainable parameters of your model is, for example, the number of hidden layers itself (2). Other could be the nodes on each hidden layer (500 in this case), or even the nodes on each individual layer, giving you one parameter per layer plus the number of layers itself.

These parameters are "non-trainable" because you can't optimize its value with your training data. Training algorithms (like back-propagation) will optimize and update the weights of your network, which are the actual trainable parameters here (usually several thousands, depending on your connections). Your training data as it is can't help you determine those non-trainable parameters.

However, this does not mean that numberHiddenLayers is not trainable at all, it only means that in this model and its implementation we are unable to do so. We could make numberHiddenLayers trainable; the easiest way would be to define another ML algorithm that takes this model as input and trains it with several values of numberHiddenLayers. The best value is obtained with the model that outperformed the others, thus optimizing the numberHiddenLayers variable.

In other words, non-trainable parameters of a model are those that you will not be updating and optimized during training, and that have to be defined a priori, or passed as inputs.

Equipoise answered 15/11, 2017 at 16:42 Comment(3)
I believe there is a confusion here... Network topology and the likes (learning rate, dropout rate, etc.) are not the 'non-trainable parameters'; they are rather called 'hyperparameters'. Parameters are optimized automatically (using gradient descent) during the training, using the training set. Hyperparameters are optimized manually (using engineer brain), and evaluated using the dev set.Affirm
Regarding the original question, I believe 'non-trainable parameters' would be for instance the mean 'mu' and standard deviation 'sigma' computed in a BatchNorm layer, whereas the parameters 'gamma' and 'beta' are trainable parameters. To sum-up: 'trainable parameters' are those which value is modified according to their gradient (the derivative of the error/loss/cost relative to the parameter), whereas 'non-trainable parameters' are those which value is not optimized according to their gradient.Affirm
thanks for the feedback @JulienREINAULD there is plenty of space for more answer I believe if you feel you want to add something :) By your definition, hyper-parameters are also non-trainable (unless you design your algorithm to train over them).Equipoise
M
47

In keras, non-trainable parameters (as shown in model.summary()) means the number of weights that are not updated during training with backpropagation.

There are mainly two types of non-trainable weights:

  • The ones that you have chosen to keep constant when training. This means that keras won't update these weights during training at all.
  • The ones that work like statistics in BatchNormalization layers. They're updated with mean and variance, but they're not "trained with backpropagation".

Weights are the values inside the network that perform the operations and can be adjusted to result in what we want. The backpropagation algorithm changes the weights towards a lower error at the end.

By default, all weights in a keras model are trainable.

When you create layers, internally it creates its own weights and they're trainable. (The backpropagation algorithm will update these weights)

When you make them untrainable, the algorithm will not update these weights anymore. This is useful, for instance, when you want a convolutional layer with a specific filter, like a Sobel filter, for instance. You don't want the training to change this operation, so these weights/filters should be kept constant.

There is a lot of other reasons why you might want to make weights untrainable.


Changing parameters:

For deciding whether weights are trainable or not, you take layers from the model and set trainable:

model.get_layer(layerName).trainable = False #or True

This must be done before compilation.

Mutter answered 15/11, 2017 at 16:51 Comment(7)
''There is a lot of other reasons why you might want to make weights untrainable.'' What are these, if you care to explain, please?Efficacious
You may have a "pretrained model" for instance, which you know is working well and you don't want to change. You may be training a GAN and working only one side at a time. There is really a lot of creative reasons, depending on what you want.Minify
Thank you first of all but what if performance increases when using pretrained modelsEfficacious
Usually during transfer learning you set initial layers weights to non trainable so that you can get advantage of pretrained models.Slavish
do non-trainable variables participate in backpropagation of other trainable variables? say I have a 2-layer model, with the 1st layer trainable and 2nd non trainable. then the 2nd layer's output is compared to target value to calculate the loss. when calculating gradient for 1st layer, will 2nd layer operation be considered into? @HSRathoreSorrells
@DavidH.J., yes, naturally. It's impossible to reach the result without passing "throug" all layers.Minify
@DavidH.J... non-trainable means you can't update their weights, that means if there's error coming back from the last layers with backprop all of that error has to be absorbed by those unfrozen last layers and not by frozen(non-trainable) layers. Initial layers are only frozen most of the times for their ability to learn features, which they've learned from some other big dataset, and last layers are unfrozen to be modified as per the new needs. so 2nd layer being frozen where as 1st layer trainable is not a correct scenario to imagine.Slavish
G
36

There are some details that other answers do not cover.

In Keras, non-trainable parameters are the ones that are not trained using gradient descent. This is also controlled by the trainable parameter in each layer, for example:

from keras.layers import *
from keras.models import *
model = Sequential()
model.add(Dense(10, trainable=False, input_shape=(100,)))
model.summary()

This prints zero trainable parameters, and 1010 non-trainable parameters.

_________________________________________________________________    
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 10)                1010      
=================================================================
Total params: 1,010
Trainable params: 0
Non-trainable params: 1,010
_________________________________________________________________

Now if you set the layer as trainable with model.layers[0].trainable = True then it prints:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 10)                1010      
=================================================================
Total params: 1,010
Trainable params: 1,010
Non-trainable params: 0
_________________________________________________________________

Now all parameters are trainable and there are zero non-trainable parameters. But there are also layers that have both trainable and non-trainable parameters, one example is the BatchNormalization layer, where the mean and standard deviation of the activations is stored for use while test time. One example:

model.add(BatchNormalization())
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 10)                1010      
_________________________________________________________________
batch_normalization_1 (Batch (None, 10)                40        
=================================================================
Total params: 1,050
Trainable params: 1,030
Non-trainable params: 20
_________________________________________________________________

This specific case of BatchNormalization has 40 parameters in total, 20 trainable, and 20 non-trainable. The 20 non-trainable parameters correspond to the computed mean and standard deviation of the activations that is used during test time, and these parameters will never be trainable using gradient descent, and are not affected by the trainable flag.

Granuloma answered 19/10, 2018 at 14:30 Comment(3)
That's actually the right answer to this question, since author asks why some parameters in Inception model are always "non-trainable" even though you set all layers to trainable. The answer is mean/variance params for batchnorm layers.Gaberlunzie
Where does the number "20" come from? I ask because have been using preprocessing.Normalization in tensorflow.keras.layers.experimental. Here, when I use one input ([None,1]), I get 3 non-trainable parameters in the summary. However, when I use nine inputs ([None,9]), I get 19 non-trainable parameters. See the complete example here: link Where these 3 for the first case and 9 for the second come from? Seems something like inputs*2+1, but I would like to know in more detail whay they mean and how are calculated. Thanks a lot.Concubine
@Concubine From my answer "The 20 non-trainable parameters correspond to the computed mean and standard deviation of the activations that is used during test time" This corresponds to the non-trainable parameters of the BatchNormalization layer, note that other layers compute those parameters differently.Granuloma
E
29

Non-trainable parameters are quite a broad subject. A straightforward example is to consider the case of any specific NN model and its architecture.

Say we have already setup your network definition in Keras, and your architecture is something like 256->500->500->1. Based on this definition, we seem to have a Regression Model (one output) with two hidden layers (500 nodes each) and an input of 256.

One non-trainable parameters of your model is, for example, the number of hidden layers itself (2). Other could be the nodes on each hidden layer (500 in this case), or even the nodes on each individual layer, giving you one parameter per layer plus the number of layers itself.

These parameters are "non-trainable" because you can't optimize its value with your training data. Training algorithms (like back-propagation) will optimize and update the weights of your network, which are the actual trainable parameters here (usually several thousands, depending on your connections). Your training data as it is can't help you determine those non-trainable parameters.

However, this does not mean that numberHiddenLayers is not trainable at all, it only means that in this model and its implementation we are unable to do so. We could make numberHiddenLayers trainable; the easiest way would be to define another ML algorithm that takes this model as input and trains it with several values of numberHiddenLayers. The best value is obtained with the model that outperformed the others, thus optimizing the numberHiddenLayers variable.

In other words, non-trainable parameters of a model are those that you will not be updating and optimized during training, and that have to be defined a priori, or passed as inputs.

Equipoise answered 15/11, 2017 at 16:42 Comment(3)
I believe there is a confusion here... Network topology and the likes (learning rate, dropout rate, etc.) are not the 'non-trainable parameters'; they are rather called 'hyperparameters'. Parameters are optimized automatically (using gradient descent) during the training, using the training set. Hyperparameters are optimized manually (using engineer brain), and evaluated using the dev set.Affirm
Regarding the original question, I believe 'non-trainable parameters' would be for instance the mean 'mu' and standard deviation 'sigma' computed in a BatchNorm layer, whereas the parameters 'gamma' and 'beta' are trainable parameters. To sum-up: 'trainable parameters' are those which value is modified according to their gradient (the derivative of the error/loss/cost relative to the parameter), whereas 'non-trainable parameters' are those which value is not optimized according to their gradient.Affirm
thanks for the feedback @JulienREINAULD there is plenty of space for more answer I believe if you feel you want to add something :) By your definition, hyper-parameters are also non-trainable (unless you design your algorithm to train over them).Equipoise
R
3

It is clear that if you freeze any layer of the network. all params on that frozen layer turn to non-trainable. On the other hand if you design your network from the scratch, it might have some non-trainable parameters too. For instance batchnormalization layer has 4 parameter which are;

[gamma weights, beta weights, moving_mean, moving_variance]

The first two of them are trainable but last two are not. So the batch normalization layer is highly probably the reason that your custom network has non-trainable paramteres.

Raby answered 7/8, 2018 at 14:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.