Is train/test-Split in unsupervised learning necessary/useful?

Asked 28/7, 2015 at 10:14 Answered 18/12, 2020 at 1:8

In supervised learning I have the typical train/test split to learn the algorithm, e.g. Regression or Classification. Regarding unsupervised learning, my question is: Is train/test split necessary and useful? If yes, why?

Africanize answered 28/7, 2015 at 10:14 Comment(6)

Counter question: How do you test? – Bittner 28/7, 2015 at 12:21

@Bittner I am not sure what you mean with your question? The thing is: In supervised learning I have the real ouput and I can compare it with it. But in unsupervised learning the algorithm works by finding e.g. similarities in the data. But how do I measure the performance? – Africanize 28/7, 2015 at 13:36

Yes, that's exactly my point. Testing is not straight forward since you don't know what's right and what's wrong. So the general principle of dividing into training and testing sets can not be easily applied to unsupervised learning. – Bittner 28/7, 2015 at 13:39

Okay, thanks I think I got it – Africanize 28/7, 2015 at 17:31

@cel,@ChristophS Does the conclusion here imply "no need for testing data" in unsupervised learning? – Minny 17/5, 2016 at 7:35

@Gathide, well not in the traditional sense. Obviously you always have to show that your algorithm works (=does what you want it to do). But this is much harder since the standard metrics like accuracy, etc. do not work out of the box. – Bittner 17/5, 2016 at 7:40

Well This Depend on the Problem, the form of dataset and Class of Unsupervised algorithm used to solve the particular problem.

Roughly:- Dimensionality reduction techniques are usually tested by calculating the error in reconstruction so there we can use k-fold cross-validation procedure

But on clustering algorithm, I would suggest doing statistical testing in order to test performance. There is also little time-consuming trick which splitting dataset and hand label the test set with meaningfull classes and cross validate

In any case unsupervised algorithm is used on supervised data then it always good cross-validate

overall:- It is not necessary to split data in the train-test set but if we can do it it is always better

Here is article which explains how cross-validation is a good tool for unsupervised learning http://udini.proquest.com/view/cross-validation-for-unsupervised-pqid:1904931481/ and the full text is available here http://arxiv.org/pdf/0909.3052.pdf

https:///www.researchgate.net/post/Which_are_the_methods_to_validate_an_unsupervised_machine_learning_algorithm

Fabliau answered 6/12, 2017 at 19:52 Comment(0)

Definitely it is useful.

Few points that I know about "why".

When testing a model comes into the story, it should always perform on unseen data. So it is better that you have spitted data using train_test_split.

The second case is that the data should always be shuffled in the format. Otherwise, the n-1 type of data will occur when fitting the model that may not give good results.

Remunerate answered 18/12, 2020 at 1:8 Comment(0)

Recommended topics

Hot tags