I want to divide my dataset into a training set(70%) and a test set(30%). I used unsupervised resample filter in this regard. The steps I followed for the partition are as follows
Select unsupervised -> instances -> resample filter from WEKA preprocess tab
Select samplesize percent 70 from the property window of resample filter.
Apply and save the dataset.
Undo after saving the dataset.
select invertselection true and samplesizepercent 30 from the property window of resample filter.
Apply and save the dataset.
Now, I am not sure, did I partition my data into training and test set in this way? Is this the right way of partitioning? I am skeptical because I got higher accuracy at the time of classification with respect to stratified filter partitioning.
By default, the Resample filter uses sampling with replacement. You probably ended up with duplicate instances, which will skew the results.
You need to disable replacement of instances by using the -no-replacement
command-line option or set the property noReplacement
in the GUI to true
.
An alternative approach would be using a MultiFilter setup with the following sub-filters:
Since you are removing data, use 30% for generating the training data and then invert the selection (-V
) to obtain the test data.