What is validation data used for in a Keras Sequential model?
Asked Answered
K

4

102

My question is simple, what is the validation data passed to model.fit in a Sequential model used for?

And, does it affect how the model is trained (normally a validation set is used, for example, to choose hyper-parameters in a model, but I think this does not happen here)?

I am talking about the validation set that can be passed like this:

# Create model
model = Sequential()
# Add layers
model.add(...)

# Train model (use 10% of training set as validation set)
history = model.fit(X_train, Y_train, validation_split=0.1)

# Train model (use validation data as validation set)
history = model.fit(X_train, Y_train, validation_data=(X_test, Y_test))

I investigated a bit, and I saw that keras.models.Sequential.fit calls keras.models.training.fit, which creates variables like val_accand val_loss (which can be accessed from Callbacks). keras.models.training.fit also calls keras.models.training._fit_loop, which adds the validation data to the callbacks.validation_data, and also calls keras.models.training._test_loop, which will loop the validation data in batches on the self.test_function of the model. The result of this function is used to fill the values of the logs, which are the values accessible from the callbacks.

After seeing all this, I feel that the validation set passed to model.fit is not used to validate anything during training, and its only use is to get feedback on how the trained model will perform in every epoch for a completely independent set. Therefore, it would be okey to use the same validation and test set, right?

Could anyone confirm if the validation set in model.fit has any other goal besides being read from the callbacks?

Kemeny answered 19/9, 2017 at 19:28 Comment(0)
M
110

If you want to build a solid model you have to follow that specific protocol of splitting your data into three sets: One for training, one for validation and one for final evaluation, which is the test set.

The idea is that you train on your training data and tune your model with the results of metrics (accuracy, loss etc) that you get from your validation set.

Your model doesn't "see" your validation set and isn't in any way trained on it, but you as the architect and master of the hyperparameters tune the model according to this data. Therefore it indirectly influences your model because it directly influences your design decisions. You nudge your model to work well with the validation data and that can possibly bring in a tilt.

Exactly that is the reason you only evaluate your model's final score on data that neither your model nor you yourself has used – and that is the third chunk of data, your test set.

Only this procedure makes sure you get an unaffected view of your models quality and ability to generalize what is has learned on totally unseen data.

Marbut answered 19/9, 2017 at 19:33 Comment(6)
Ok, I had already figured it out, but it's exactly like you say. Basically, because we can use the validation accuracy and loss to learn something about the model, we need a different test set to validate what we learned. For example, if I have 3 models, I train them in the same training data, I get a validation accuracy for each of them that I use to pick the "best model", and then I test my chosen model in a different test set so I can get the accuracy of the model. If I used the validation set for this, the result would be biased.Kemeny
what is the same validation workaround when we want to use train_on_batch() for a large dataset in keras?Lillian
when using "model.fit(X_train, Y_train, validation_data=(X_test, Y_test))" does one still have to use ".predict()" or ".evaluate()" (with X_test, Y_test or another set)?Eucalyptol
@Eucalyptol yes. The "another" set is called the test set. This is necessary for unbiased estimation. It is always good (or at least does no harm) if you can manage to do it. You may have a look on my answer for more details.Capsule
@Guido Mocha for a validation set it is necessary for it to reflect the real world data that is the data coming from the same practical domain where the model will be used. So if you are confident that validation set will cover it, therefore you are good to go, no matter mini-batch / batch / stochastic gradient descent. Again the validation set need not to be enormously large. If you can ensure it is covering almost all the cases you are interested in, then, validation set can be smaller. Shuffling all the data before spiting into train and validation set is helpful for uniform distribution.Capsule
Thanks for the clear explanation. I had a senior data scientist tell me to my face today that failing to set aside a 3rd group of test data will result in over-fitting making my results invalid. Based on your explanation here, potentially biased is not invalid, there is a difference. I badly needed this sanity check, and I further conclude that if I commit to not further alter the hyperparameters if and when I do finally see fresh 'test' data, then I'm not even potentially biased?Ventriloquism
P
32

This YouTube video explains what a validation set is, why it's helpful, and how to implement a validation set in Keras: Create a validation set in Keras

With a validation set, you're essentially taking a fraction of your samples out of your training set, or creating an entirely new set all together, and holding out the samples in this set from training.

During each epoch, the model will be trained on samples in the training set but will NOT be trained on samples in the validation set. Instead, the model will only be validating on each sample in the validation set.

The purpose of doing this is for you to be able to judge how well your model can generalize. Meaning, how well is your model able to predict on data that it's not seen while being trained.

Having a validation set also provides great insight into whether your model is overfitting or not. This can be interpreted by comparing the acc and loss from your training samples to the val_acc and val_loss from your validation samples. For example, if your acc is high, but your val_acc is lagging way behind, this is good indication that your model is overfitting.

Partitive answered 20/9, 2017 at 0:12 Comment(1)
what is the same validation workaround when we want to use train_on_batch() for a large dataset in keras?Lillian
C
14

I think an overall discussion on train-set, validation-set and test-set will help:

  • Train-Set: The data-set on which the model is being trained on. This is the only data-set on which the weights are updated during back-propagation.
  • Validation-Set (Development Set): The data-set on which we want our model to perform well. During the training process we tune hyper-parameters such that the model performs well on dev-set (but don't use dev-set for training, it is only used to see the performance such that we can decide on how to change the hyper-parameters and after changing hyper-parameters we continue our training on train-set). Dev-set is only used for tuning hyper-parameters to make the model eligible for working well on unknown data (here dev-set is considered as a representative of unknown data-set as it is not directly used for training and additionally saying the hyper-parameters are like tuning knobs to change the way of training) and no back-propagation occurs on dev-set and hence no direct learning from it.
  • Test-Set: We just use it for unbiased estimation. Like dev-set, no training occurs on test-set. The only difference from validation-set (dev- set) is that we don't even tune the hyper-parameters here and just see how well our model have learnt to generalize. Although, like test-set, dev-set is not directly used for training, but as we repeatedly tune hyper-parameters targeting the dev-set, our model indirectly learns the patterns from dev-set and the dev-set becomes no longer unknown to the model. Hence we need another fresh copy of dev-set which is not even used for hyper parameter tuning, and we call this fresh copy of dev-set as test set. As according to the definition of test-set it should be "unknown" to the model. But if we cannot manage a fresh and unseen test set like this, then sometimes we say the dev-set as the test-set.

Summarizing:

  • Train-Set: Used for training.
  • Validation-Set / Dev-Set: Used for tuning hyper-parameters.
  • Test-Set: Used for unbiased estimation.

Again some practical issues here:

  • For training you may collect data from anywhere. It is okay if your all collected data are not from the same domain where the model will be used. For example if the real domain is the photos taken with smartphone camera, it is not necessary to make data-set with smartphone photos only. You may include data from internet, high-end or low-end cameras or from anywhere.
  • For dev-set and test-set it is necessary to reflect the real domain data where the model will be practically used. Also it should contain all possible cases for better estimation.
  • Dev-set and test-set need not to be that large. Just ensure that it is almost covering all cases or situations that may occur in real data. After ensuring it try to give as much data as you can to build train-set.
Capsule answered 16/3, 2020 at 3:38 Comment(6)
best answer.. also i used to think that hyperparameters are same as parameters..ur answer made me google about it.. for people like me this is ur answer datascience.stackexchange.com/questions/14187/…Dongola
This hyperparameters tunning is done automatically or do I have to do it manually? Weights are updated automatically with backpropagation, and I'm wondering is the hyperparameter tunning is done by another algorithm.Rockefeller
@VansFannel, hyper-parameters are the variables we use to control the way how the learning process to be done. If it was done automatically then we would have no control over the training process. If you don't want to tune them you can always choose the default values for them. In most cases it is okay but sometimes specially for new cases if you have no prior idea of working on that data, it is recommended to tune them manually.Capsule
@Capsule So, I have to do it manually. Check the validation loss and accuracy, and try another parameters until I get a better result, isn't it?Rockefeller
@Rockefeller yes, it is recommended if you have no prior idea which value to choose or if you are not sure how the model will behave. But you can initially choose the default values for these hyper-parameters. If it meets your need than you are done. Otherwise gradually change them and see the behavior. Don't change more than one hyper-parameters at a time, otherwise you won't be able to know actually who is responsible for a certain change. So change them one by one. You can use learning rate scheduler for gradually decreasing learning rate. You may also try grid search over hyper-parameters.Capsule
@Rockefeller grid search can help you in this case. It will take possible inputs for hyper-parameters from you and will try them all. Finally it will inform you the most promising configuration for that certain model training on that certain data. It is available in Scikit-Learn. See here for more details: scikit-learn.org/stable/modules/generated/…Capsule
D
5

So Basically in the validation set, the model will try to predict but it won't update its weights (which means that it won't learn from them) so you will get a clear idea of how well your model can find patterns in the training data and apply it to new data.

Determinate answered 15/4, 2021 at 16:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.