For example: If I want to train a classifier (maybe SVM), how many sample do I need to collect? Is there a measure method for this?
It is not easy to know how many samples you need to collect. However you can follow these steps:
For solving a typical ML problem:
- Build a dataset a with a few samples, how many? it will depend on the kind of problem you have, don't spend a lot of time now.
- Split your dataset into train, cross, test and build your model.
- Now that you've built the ML model, you need to evaluate how good it is. Calculate your test error
- If your test error is beneath your expectation, collect new data and repeat steps 1-3 until you hit a test error rate you are comfortable with.
This method will work if your model is not suffering "high bias".
This video from Coursera's Machine Learning course, explains it.
Unfortunately, there is no simple method for this.
The rule of thumb is the bigger, the better, but in practical use, you have to gather the sufficient amount of data. By sufficient I mean covering as big part of modeled space as you consider acceptable.
Also, amount is not everything. The quality of test samples is very important too, i.e. training samples should not contain duplicates.
Personally, when I don't have all possible training data at once, I gather some training data and then train a classifier. Then I classifier quality is not acceptable, I gather more data, etc.
Here is some piece of science about estimating training set quality.
This depends a lot on the nature of the data and the prediction you are trying to make, but as a simple rule to start with, your training data should be roughly 10X the number of your model parameters. For instance, while training a logistic regression with N features, try to start with 10N training instances.
For an empirical derivation of the "rule of 10", see https://medium.com/@malay.haldar/how-much-training-data-do-you-need-da8ec091e956
The practical way is to plot curves showing the relationship between training set size on a logarithmic scale and generalization (test) error. This plot can show how much additional data would be needed.
© 2022 - 2025 — McMap. All rights reserved.