Why do we have to normalize the input for an artificial neural network? [closed]
Asked Answered
C

10

191

Why do we have to normalize the input for a neural network?

I understand that sometimes, when for example the input values are non-numerical a certain transformation must be performed, but when we have a numerical input? Why the numbers must be in a certain interval?

What will happen if the data is not normalized?

Constantino answered 12/1, 2011 at 22:16 Comment(1)
I’m voting to close this question because Machine learning (ML) theory questions are off-topic on Stack Overflow - gift-wrap candidate for Cross-ValidatedHalsey
F
127

It's explained well here.

If the input variables are combined linearly, as in an MLP [multilayer perceptron], then it is rarely strictly necessary to standardize the inputs, at least in theory. The reason is that any rescaling of an input vector can be effectively undone by changing the corresponding weights and biases, leaving you with the exact same outputs as you had before. However, there are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima. Also, weight decay and Bayesian estimation can be done more conveniently with standardized inputs.

Forked answered 12/1, 2011 at 22:34 Comment(3)
Hi, In MLPs, can not standardizing the features while using a constant learning rate cause over/undercompensation in backpropagation corrections for different dimensions? I'm wondering from the following post if this is exclusive to CNNs, or if MLPs might share this problem: stats.stackexchange.com/questions/185853/…Dotdotage
Problem: Gradient Descent opt. process may take a lot longer. Why? When features are of different scale (x1=0-1 and x2=0..1000), the error function surface may become elongated. Meaning: different scales for different dims (w1,w2). But learning rate is the SAME for all dims --> steps in elongated dim (w2) are very small until reaches the local min. Problem: Cannot inc LR, since it will skip the local min in the other dim (w1). See demo at youtube.com/watch?reload=9&v=UIp2CMI0748Bystreet
Here is a linear example, where things are very bad without scaling: https://mcmap.net/q/136843/-why-does-single-layer-perceptron-converge-so-slow-without-normalization-even-when-the-margin-is-large. Any idea why?Chaworth
I
73

In neural networks, it is good idea not just to normalize data but also to scale them. This is intended for faster approaching to global minima at error surface. See the following pictures: error surface before and after normalization

error surface before and after scaling

Pictures are taken from the coursera course about neural networks. Author of the course is Geoffrey Hinton.

Idaliaidalina answered 28/1, 2015 at 11:43 Comment(2)
It would have been nice of you to credit the author of the graphic you posted. The graphic was clearly taken from Geoffrey Hinton's coursera course.Jud
I found this video to be really helpful in explaining the diagram above, which on its own was not obvious to me.Cockfight
K
23

Some inputs to NN might not have a 'naturally defined' range of values. For example, the average value might be slowly, but continuously increasing over time (for example a number of records in the database).

In such case feeding this raw value into your network will not work very well. You will teach your network on values from lower part of range, while the actual inputs will be from the higher part of this range (and quite possibly above range, that the network has learned to work with).

You should normalize this value. You could for example tell the network by how much the value has changed since the previous input. This increment usually can be defined with high probability in a specific range, which makes it a good input for network.

Kinlaw answered 12/1, 2011 at 22:45 Comment(1)
Nice hint about normalizing upon the previous set of inputs. This relieves the user from defining an arbitrary normalization factor. However I suspect the net will train more accurately if the normalization factor is a global constant applied to each input vector.Grindstone
S
20

There are 2 Reasons why we have to Normalize Input Features before Feeding them to Neural Network:

Reason 1: If a Feature in the Dataset is big in scale compared to others then this big scaled feature becomes dominating and as a result of that, Predictions of the Neural Network will not be Accurate.

Example: In case of Employee Data, if we consider Age and Salary, Age will be a Two Digit Number while Salary can be 7 or 8 Digit (1 Million, etc..). In that Case, Salary will Dominate the Prediction of the Neural Network. But if we Normalize those Features, Values of both the Features will lie in the Range from (0 to 1).

Reason 2: Front Propagation of Neural Networks involves the Dot Product of Weights with Input Features. So, if the Values are very high (for Image and Non-Image Data), Calculation of Output takes a lot of Computation Time as well as Memory. Same is the case during Back Propagation. Consequently, Model Converges slowly, if the Inputs are not Normalized.

Example: If we perform Image Classification, Size of Image will be very huge, as the Value of each Pixel ranges from 0 to 255. Normalization in this case is very important.

Mentioned below are the instances where Normalization is very important:

  1. K-Means
  2. K-Nearest-Neighbours
  3. Principal Component Analysis (PCA)
  4. Gradient Descent
Selena answered 19/2, 2020 at 4:34 Comment(1)
how would you suggest normalizing something unbounded like salary? Salaries can be arbitrarily high. So if you normalize them simply using a mean and standard deviation, then the model you learn will over time get worse as the distribution shifts. In the extreme, what if you have an RL problem that involves money? How should a model for a company's decision handle (normalize) having no revenue at first, then a little revenue, then eventually orders of magnitude more revenue?Polyhydric
C
12

When you use unnormalized input features, the loss function is likely to have very elongated valleys. When optimizing with gradient descent, this becomes an issue because the gradient will be steep with respect some of the parameters. That leads to large oscillations in the search space, as you are bouncing between steep slopes. To compensate, you have to stabilize optimization with small learning rates.

Consider features x1 and x2, where range from 0 to 1 and 0 to 1 million, respectively. It turns out the ratios for the corresponding parameters (say, w1 and w2) will also be large.

Normalizing tends to make the loss function more symmetrical/spherical. These are easier to optimize because the gradients tend to point towards the global minimum and you can take larger steps.

Chamonix answered 23/3, 2020 at 19:32 Comment(1)
By Andrew Ng from this video: youtube.com/watch?v=UIp2CMI0748Kaolack
A
9

Looking at the neural network from the outside, it is just a function that takes some arguments and produces a result. As with all functions, it has a domain (i.e. a set of legal arguments). You have to normalize the values that you want to pass to the neural net in order to make sure it is in the domain. As with all functions, if the arguments are not in the domain, the result is not guaranteed to be appropriate.

The exact behavior of the neural net on arguments outside of the domain depends on the implementation of the neural net. But overall, the result is useless if the arguments are not within the domain.

Amar answered 12/1, 2011 at 22:38 Comment(2)
If you take a usual activation function (ReLu or Sigmoid), the domain is always the whole space R^n. So this cannot be the reason to normalise the data.Longwinded
This also does not explain why images are normalized, since they already have a domain 0-255Sulfide
C
2

I believe the answer is dependent on the scenario.

Consider NN (neural network) as an operator F, so that F(input) = output. In the case where this relation is linear so that F(A * input) = A * output, then you might choose to either leave the input/output unnormalised in their raw forms, or normalise both to eliminate A. Obviously this linearity assumption is violated in classification tasks, or nearly any task that outputs a probability, where F(A * input) = 1 * output

In practice, normalisation allows non-fittable networks to be fittable, which is crucial to experimenters/programmers. Nevertheless, the precise impact of normalisation will depend not only on the network architecture/algorithm, but also on the statistical prior for the input and output.

What's more, NN is often implemented to solve very difficult problems in a black-box fashion, which means the underlying problem may have a very poor statistical formulation, making it hard to evaluate the impact of normalisation, causing the technical advantage (becoming fittable) to dominate over its impact on the statistics.

In statistical sense, normalisation removes variation that is believed to be non-causal in predicting the output, so as to prevent NN from learning this variation as a predictor (NN does not see this variation, hence cannot use it).

Cheke answered 11/12, 2017 at 12:18 Comment(0)
D
1

The reason normalization is needed is because if you look at how an adaptive step proceeds in one place in the domain of the function, and you just simply transport the problem to the equivalent of the same step translated by some large value in some direction in the domain, then you get different results. It boils down to the question of adapting a linear piece to a data point. How much should the piece move without turning and how much should it turn in response to that one training point? It makes no sense to have a changed adaptation procedure in different parts of the domain! So normalization is required to reduce the difference in the training result. I haven't got this written up, but you can just look at the math for a simple linear function and how it is trained by one training point in two different places. This problem may have been corrected in some places, but I am not familiar with them. In ALNs, the problem has been corrected and I can send you a paper if you write to wwarmstrong AT shaw.ca

Discouragement answered 28/10, 2014 at 2:29 Comment(0)
A
0

On a high level, if you observe as to where normalization/standardization is mostly used, you will notice that, anytime there is a use of magnitude difference in model building process, it becomes necessary to standardize the inputs so as to ensure that important inputs with small magnitude don't loose their significance midway the model building process.

example:

√(3-1)^2+(1000-900)^2 ≈ √(1000-900)^2

Here, (3-1) contributes hardly a thing to the result and hence the input corresponding to these values is considered futile by the model.

Consider the following:

  1. Clustering uses euclidean or, other distance measures.
  2. NNs use optimization algorithm to minimise cost function(ex. - MSE).

Both distance measure(Clustering) and cost function(NNs) use magnitude difference in some way and hence standardization ensures that magnitude difference doesn't command over important input parameters and the algorithm works as expected.

Aquiver answered 21/1, 2021 at 13:39 Comment(0)
C
-14

Hidden layers are used in accordance with the complexity of our data. If we have input data which is linearly separable then we need not to use hidden layer e.g. OR gate but if we have a non linearly seperable data then we need to use hidden layer for example ExOR logical gate. Number of nodes taken at any layer depends upon the degree of cross validation of our output.

Caporal answered 16/9, 2014 at 11:11 Comment(1)
The question was why inputs have to be normalized, not why neural nets have hidden layers.Gelation

© 2022 - 2024 — McMap. All rights reserved.