classificationwekaunsupervised-learningdata-partitioning

Weka Unsupervised resample filter for data partition


I want to divide my dataset into a training set(70%) and a test set(30%). I used unsupervised resample filter in this regard. The steps I followed for the partition are as follows

  1. Select unsupervised -> instances -> resample filter from WEKA preprocess tab

  2. Select samplesize percent 70 from the property window of resample filter.

  3. Apply and save the dataset.

  4. Undo after saving the dataset.

  5. select invertselection true and samplesizepercent 30 from the property window of resample filter.

  6. Apply and save the dataset.

Now, I am not sure, did I partition my data into training and test set in this way? Is this the right way of partitioning? I am skeptical because I got higher accuracy at the time of classification with respect to stratified filter partitioning.


Solution

  • By default, the Resample filter uses sampling with replacement. You probably ended up with duplicate instances, which will skew the results.

    You need to disable replacement of instances by using the -no-replacement command-line option or set the property noReplacement in the GUI to true.

    An alternative approach would be using a MultiFilter setup with the following sub-filters:

    Since you are removing data, use 30% for generating the training data and then invert the selection (-V) to obtain the test data.