How can I know training data is enough for machine learning

Asked 15/7, 2014 at 8:3 Answered 22/8, 2024 at 4:46

Solved machine-learning classification sample-data

For example: If I want to train a classifier (maybe SVM), how many sample do I need to collect? Is there a measure method for this?

Bobo answered 15/7, 2014 at 8:3 Comment(0)

It is not easy to know how many samples you need to collect. However you can follow these steps:

For solving a typical ML problem:

Build a dataset a with a few samples, how many? it will depend on the kind of problem you have, don't spend a lot of time now.
Split your dataset into train, cross, test and build your model.
Now that you've built the ML model, you need to evaluate how good it is. Calculate your test error
If your test error is beneath your expectation, collect new data and repeat steps 1-3 until you hit a test error rate you are comfortable with.

This method will work if your model is not suffering "high bias".

This video from Coursera's Machine Learning course, explains it.

Kacey answered 15/7, 2014 at 8:24 Comment(0)

Unfortunately, there is no simple method for this.

The rule of thumb is the bigger, the better, but in practical use, you have to gather the sufficient amount of data. By sufficient I mean covering as big part of modeled space as you consider acceptable.

Also, amount is not everything. The quality of test samples is very important too, i.e. training samples should not contain duplicates.

Personally, when I don't have all possible training data at once, I gather some training data and then train a classifier. Then I classifier quality is not acceptable, I gather more data, etc.

Here is some piece of science about estimating training set quality.

Coadjutrix answered 15/7, 2014 at 8:25 Comment(0)

This depends a lot on the nature of the data and the prediction you are trying to make, but as a simple rule to start with, your training data should be roughly 10X the number of your model parameters. For instance, while training a logistic regression with N features, try to start with 10N training instances.

For an empirical derivation of the "rule of 10", see https://medium.com/@malay.haldar/how-much-training-data-do-you-need-da8ec091e956

Glossotomy answered 5/11, 2016 at 19:39 Comment(1)

I'm using logistic regression to classify review comments. After I normalize and vectorize the data I have an array where each column is a unique word. Above when you say "parameters", "features" and "training instances" how does this relate to the number of review comments vs. unique words that I apply the 10X rule to? – Iaria 24/2, 2018 at 15:54

The practical way is to plot curves showing the relationship between training set size on a logarithmic scale and generalization (test) error. This plot can show how much additional data would be needed.

Source: "Deep Learning" by Ian Goodfellow et al

Kinship answered 22/8, 2024 at 4:46 Comment(0)

Recommended topics

Hot tags