machine-learningclassificationsample-data

How can I know training data is enough for machine learning


For example: If I want to train a classifier (maybe SVM), how many sample do I need to collect? Is there a measure method for this?


Solution

  • It is not easy to know how many samples you need to collect. However you can follow these steps:

    For solving a typical ML problem:

    1. Build a dataset a with a few samples, how many? it will depend on the kind of problem you have, don't spend a lot of time now.
    2. Split your dataset into train, cross, test and build your model.
    3. Now that you've built the ML model, you need to evaluate how good it is. Calculate your test error
    4. If your test error is beneath your expectation, collect new data and repeat steps 1-3 until you hit a test error rate you are comfortable with.

    This method will work if your model is not suffering "high bias".

    This video from Coursera's Machine Learning course, explains it.