In supervised learning I have the typical train/test split to learn the algorithm, e.g. Regression or Classification. Regarding unsupervised learning, my question is: Is train/test split necessary and useful? If yes, why?
Well This Depend on the Problem, the form of dataset and Class of Unsupervised algorithm used to solve the particular problem.
Roughly:- Dimensionality reduction techniques are usually tested by calculating the error in reconstruction so there we can use k-fold cross-validation procedure
But on clustering algorithm, I would suggest doing statistical testing in order to test performance. There is also little time-consuming trick which splitting dataset and hand label the test set with meaningfull classes and cross validate
In any case unsupervised algorithm is used on supervised data then it always good cross-validate
overall:- It is not necessary to split data in the train-test set but if we can do it it is always better
Here is article which explains how cross-validation is a good tool for unsupervised learning http://udini.proquest.com/view/cross-validation-for-unsupervised-pqid:1904931481/ and the full text is available here http://arxiv.org/pdf/0909.3052.pdf
Definitely it is useful.
Few points that I know about "why".
When testing a model comes into the story, it should always perform on unseen data. So it is better that you have spitted data using train_test_split.
The second case is that the data should always be shuffled in the format. Otherwise, the n-1 type of data will occur when fitting the model that may not give good results.
© 2022 - 2024 — McMap. All rights reserved.