Why not optimize hyperparameters on train dataset?

Asked 5/7, 2016 at 21:44 Answered 30/9, 2020 at 12:39

Solved machine-learning neural-network training-data

When developing a neural net one typically partitions training data into Train, Test, and Holdout datasets (many people call these Train, Validation, and Test respectively. Same things, different names). Many people advise selecting hyperparameters based on performance in the Test dataset. My question is: why? Why not maximize performance of hyperparameters in the Train dataset, and stop training the hyperparameters when we detect overfitting via a drop in performance in the Test dataset? Since Train is typically larger than Test, would this not produce better results compared to training hyperparameters on the Test dataset?

UPDATE July 6 2016

Terminology change, to match comment below. Datasets are now termed Train, Validation, and Test in this post. I do not use the Test dataset for training. I am using a GA to optimize hyperparameters. At each iteration of the outer GA training process, the GA chooses a new hyperparameter set, trains on the Train dataset, and evaluates on the Validation and Test datasets. The GA adjusts the hyperparameters to maximize accuracy in the Train dataset. Network training within an iteration stops when network overfitting is detected (in the Validation dataset), and the outer GA training process stops when overfitting of the hyperparameters is detected (again in Validation). The result is hyperparameters psuedo-optimized for the Train dataset. The question is: why do many sources (e.g. https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf, Section B.1) recommend optimizing the hyperparameters on the Validation set, rather than the Train set? Quoting from Srivasta, Hinton, et al (link above): "Hyperparameters were tuned on the validation set such that the best validation error was produced..."

Standish answered 5/7, 2016 at 21:44 Comment(0)

The reason is that developing a model always involves tuning its configuration: for example, choosing the number of layers or the size of the layers (called the hyper-parameters of the model, to distinguish them from the parameters, which are the network’s weights). You do this tuning by using as a feedback signal the performance of the model on the validation data. In essence, this tuning is a form of learning: a search for a good configuration in some parameter space. As a result, tuning the configuration of the model based on its performance on the validation set can quickly result in overfitting to the validation set, even though your model is never directly trained on it.

Central to this phenomenon is the notion of information leaks. Every time you tune a hyperparameter of your model based on the model’s performance on the validation set, some information about the validation data leaks into the model. If you do this only once, for one parameter, then very few bits of information will leak, and your validation set will remain reliable to evaluate the model. But if you repeat this many times—running one experiment, evaluating on the validation set, and modifying your model as a result—then you’ll leak an increasingly significant amount of information about the validation set into the model.

At the end of the day, you’ll end up with a model that performs artificially well on the validation data, because that’s what you optimized it for. You care about performance on completely new data, not the validation data, so you need to use a completely different, never-before-seen dataset to evaluate the model: the test dataset. Your model shouldn’t have had access to any information about the test set, even indirectly. If anything about the model has been tuned based on test set performance, then your measure of generalization will be flawed.

Committee answered 21/1, 2018 at 6:52 Comment(2)

Thank you Prakhar. I agree about info leaks and not overfitting to Test or Validation. But I still am left with my original question: why not a) tune hyperparameters based on performance in the Train dataset, b) use Validation to detect hyperparameter overfitting, and c) use Test to estimate performance on new unseen data? – Standish 21/1, 2018 at 14:56

What you are saying is just re-framing my answer. Think this way: a.) what do you mean by tuning. When you start training the model using trainset, you fix the hyperparameters(number of layers, number of neurons in each layer etc.) and basically learn the weights. b.) Then evaluate this using validation set. In short, "all" combinations of hyperpameters can learn a set of weights having low loss on training set, but what value of hyperparameters are best is decided using Validation set. – Committee 21/1, 2018 at 15:26

There are two things you are missing here. First, minor, is that test set is never used to do any training. This is a purpose of validation (test is just to asses your final, testing performance). The major missunderstanding is what it means "to use validation set to fit hyperparameters". This means exactly what you describe - to train a model with a given hyperparameters on the training set, and use validation to simply check if you are overfitting (you use it to estimate generalization) , but you do not really "train" on them, you simply check your scores on this subset (which, as you noticed - is way smaller).

You cannot "stop training hyperparamters" because this is not a continuous process, usually hyperparameters are just "possible sets of values", and you have to simply test lots of them, there is no valid way of defining a direct trainingn procedure between actual metric you are interested in (like accuracy) and hyperparameters (like size of the hidden layer in NN or even C parameter in SVM), as the functional link between these two is not differentiable, is highly non convex and in general "ugly" to optimize. If you can define a nice optimization procedure in terms of a hyperparameter than it is usually not called a hyperparameter but a parameter, the crucial distinction in this naming convention is what makes it hard to optimize directly - we call hyperparameter a parameter, than cannot be directly optimized against thus you need a "meta method" (like simply testing on validation set) to select it.

However, you can define a "nice" meta optimization protocol for hyperparameters, but this will still use validation set as an estimator, for example Bayesian optimization of hyperparameters does exactly this - it tries to fit a function saying how well is you model behaving in the space of hyperparameters, but in order to have any "training data" for this meta-method, you need validation set to estimate it for any given set of hyperparameters (input to your meta method)

Vimineous answered 5/7, 2016 at 21:49 Comment(5)

Thanks! Please see my update above. We have a terminology difference: Me:[Train, Test, Holdout] = You:[Train, Validation, Test]. I used your terms in the ujpdate above. – Standish 6/7, 2016 at 19:23

Please do not significantly modify questions on SO. If you have otger question ask the separate one. – Vimineous 6/7, 2016 at 19:39

Please see clarification above. – Standish 7/7, 2016 at 0:13

again, this is a different question, asking about your very specific heuristic, which will require long answer – Vimineous 7/7, 2016 at 6:41

I am asking exactly the same question: Why does the literature recommend optimizing hyperparameters based on accuracy in the Validation dataset rather than the Training dataset? – Standish 7/7, 2016 at 17:55

simple answer: we do

In the case of a simple feedforward neural network you do have to select e.g. layer and unit count per layer, regularization (and non-continuous parameters like topology if not feedforward and loss function) in the beginning and you would optimize on those.

So, in summary you optimize:

ordinary parameters only during training but not during validation
hyperparameters during training and during validation

It is very important not to touch the many ordinary parameters (weights and biases) during validation. That's because there are thousands of degrees of freedom in them which means they can learn the data you train them on. But then the model doesn't generalize to new data as well (even when that new data originated from the same distribution). You usually only have very few degrees of freedom in the hyperparameters which usually control the rigidity of the model (regularization).

This holds true for other machine learning algorithms like decision trees, forests, etc as well.

Godunov answered 30/9, 2020 at 12:39 Comment(0)

Recommended topics

Hot tags