machine-learningdata-sciencedata-science-experience

Can Training dataset and testing data set be seperate instead of split


Can we have a separate data set for both Training and Testing. I'm working on a project to pick up effective test case AS part of this, i analyse the bug database and come up with triggers which have yielded bug and arrive at a model . So this bug database forms my training set. The test cases that I have written is my test data and i have to supply this test data to the model to say if the test case is effective or not. so in this case, instead of splitting the dataset into training and test data, i have to have two different data sets (test data from bug database) and training data (test cases generated manually) is this something do-able using Machine Learning? Kindly let me know.


Solution

  • Yes, the training dataset and testing dataset can be separate files. In real world cases, the testing data is generally some separate unseen dataset.

    The main principle to follow is that when training the model, a dataset must be kept separate (hold out set) for testing. This data can be provided separately in different files, databases or even generate using splits. This is done to avoid data leaks (when testing data is somehow used for training the model).