Neural network immediately overfitting

F

2

5

I have a FFNN with 2 hidden layers for a regression task that overfits almost immediately (epoch 2-5, depending on # hidden units). (ReLU, Adam, MSE, same # hidden units per layer, tf.keras)

32 neurons:

32 neurons

128 neurons:

128 neurons

I will be tuning the number of hidden units, but to limit the search space I would like to know what the upper and lower bounds should be.

Afaik it is better to have a too large network and try to regularize via L2-reg or dropout than to lower the network's capacity -- because a larger network will have more local minima, but the actual loss value will be better.

Is there any point in trying to regularize (via e.g. dropout) a network that overfits from the get-go?

If so I suppose I could increase both bounds. If not I would lower them.

model = Sequential()
model.add(Dense(n_neurons, 'relu'))
model.add(Dense(n_neurons, 'relu'))
model.add(Dense(1, 'linear'))
model.compile('adam', 'mse')

Faggoting answered 1/7, 2018 at 10:15 Comment(2)

How much training data you have? Are your training and test (and validation) data from same distribution? If they aren't from same distribution your network will learn something completely different. Add your model code to the question. – Harter 1/7, 2018 at 10:31

Done, thanks for your comment. I have 120k samples. All sets from the same distribution. Data augmentation is an option I am looking into. – Faggoting 1/7, 2018 at 10:38

E

2

Let me try to substantiate some of the ideas here, referenced from Ian Goodfellow et. al. Deep Learning book which is available for free online:

Chapter 7: Regularization The most important point is data, one can and should avoid regularization if they have large amounts of data that best approximate the distribution. In you case, it looks like there might be a significant discrepancy between training and test data. You need to ensure the data is consistent.
Section 7.4: Data-augmentation With regards to data, Goodfellow talks about data-augmentation and inducing regularization by injecting noise (most likely Gaussian) which mathematically has the same effect. This noise works well with regression tasks as you limit the model from latching onto a single feature to overfit.
Section 7.8: Early Stopping is useful if you just want a model with the best test error. But again this only works if your data allows the training to infer the test data. If there is an immediate increase in test error the training would stop immediately.
Section 7.12: Dropout Just applying dropout to a regression model doesn't necessarily help. In fact "when extremely few labeled training examples are available, dropout is less effective". For classification, dropout forces the model to not rely on single features, but in regression all inputs might be required to compute a value rather than classify.
Chapter 11: Practicals emphasises the use of base models to ensure that the training task is not trivial. If a simple linear regression can achieve similar behaviour than you don't even have a training problem to begin with.

Bottom line is you can't just play with the model and hope for the best. Check the data, understand what is required and then apply the corresponding techniques. For more details read the book, it's very good. Your starting point should be a simple regression model, 1 layer, very few neurons and see what happens. Then incrementally experiment.

Eugene answered 1/7, 2018 at 11:7 Comment(1)

Excellent comment, thank you! Both the training and test set are pulling from the same distribution. I agree that I am severely lacking data though; I will give that injecting noise technique a go. I have a simple linear model is performing as well (or even better) than my NN, which seems odd. I would expect a NN to do better on a regression problem -- perhaps not a lot depending on the exact use case, but better none the less. – Faggoting 1/7, 2018 at 18:19

L

4

Hyperparameter tuning is generally the hardest step in ML, In general we try different values randomly and evalute the model and choose those set of values which give the best performance.

Getting back to your question, You have a high varience problem (Good in training, bad in testing).

There are eight things you can do in order

Make sure your test and training distribution are same.
Make sure you shuffle and then split the data into two sets (test and train)
A good train:test split will be 105:15K
Use a deeper network with Dropout/L2 regularization.
Increase your training set size.
Try Early Stopping
Change your loss function
Change the network architecture (Switch to ConvNets, LSTM etc).

Depending on your computation power and time you can set a bound to the number of hidden units and hidden layers you can have.

because a larger network will have more local minima.

Nope, this is not quite true, in reality as the number of input dimension increases the chance of getting stuck into a local minima decreases. So We usually ignore the problem of local minima. It is very rare. The derivatives across all the dimensions in the working space must be zero for a local/global minima. Hence, it is highly unlikely in a typical model.

One more thing, I noticed you are using linear unit for last layer. I suggest you to go for ReLu instead. In general we do not need negative values in regression. It will reduce test/train error

Take this :

In MSE 1/2 * (y_true - y_prediction)^2

because y_prediction can be nagative value. The whole MSE term may blow up to large values as y_prediction gets highly negative or highly positive.

Using a ReLu for last layer makes sure that y_prediction is positive. Hence low error will be expected.

Lifeblood answered 1/7, 2018 at 10:39 Comment(3)

Thanks for your comment. 1, 2, 3 are done. 4: it's already overfitting, so wouldn't that make things worse? Do you mean more layers, but less neurons per layer? – Faggoting 1/7, 2018 at 18:6

Yup you got it correct. There are some function that deep network learn easily that shallow nets even with high neurons cannot. Build a deep net with 3-6 layers. Apply one of the regularizer. If even this doesn't helps, you may need to change loss function or complete network architecture. – Lifeblood 1/7, 2018 at 18:20

Also remove linear unit from last layer and use ReLu. Read the new answer – Lifeblood 1/7, 2018 at 18:41

E

2

Let me try to substantiate some of the ideas here, referenced from Ian Goodfellow et. al. Deep Learning book which is available for free online:

Chapter 7: Regularization The most important point is data, one can and should avoid regularization if they have large amounts of data that best approximate the distribution. In you case, it looks like there might be a significant discrepancy between training and test data. You need to ensure the data is consistent.
Section 7.4: Data-augmentation With regards to data, Goodfellow talks about data-augmentation and inducing regularization by injecting noise (most likely Gaussian) which mathematically has the same effect. This noise works well with regression tasks as you limit the model from latching onto a single feature to overfit.
Section 7.8: Early Stopping is useful if you just want a model with the best test error. But again this only works if your data allows the training to infer the test data. If there is an immediate increase in test error the training would stop immediately.
Section 7.12: Dropout Just applying dropout to a regression model doesn't necessarily help. In fact "when extremely few labeled training examples are available, dropout is less effective". For classification, dropout forces the model to not rely on single features, but in regression all inputs might be required to compute a value rather than classify.
Chapter 11: Practicals emphasises the use of base models to ensure that the training task is not trivial. If a simple linear regression can achieve similar behaviour than you don't even have a training problem to begin with.

Bottom line is you can't just play with the model and hope for the best. Check the data, understand what is required and then apply the corresponding techniques. For more details read the book, it's very good. Your starting point should be a simple regression model, 1 layer, very few neurons and see what happens. Then incrementally experiment.

Eugene answered 1/7, 2018 at 11:7 Comment(1)

Excellent comment, thank you! Both the training and test set are pulling from the same distribution. I agree that I am severely lacking data though; I will give that injecting noise technique a go. I have a simple linear model is performing as well (or even better) than my NN, which seems odd. I would expect a NN to do better on a regression problem -- perhaps not a lot depending on the exact use case, but better none the less. – Faggoting 1/7, 2018 at 18:19

Recommended topics

Hot tags