machine-learning

Selection of training data for SVM


I know similar questions have been posed a few times here, but I have another point that is not clear to me.

I have 1098 images that I am trying to classify. As a general rule (from what I have read), the split for the data is

80/20 - Train/Test

of the 80% Training data

80/20 or 90/10 for 20-fold or 10-fold cross-validation.

Now the problem I am facing it is that the original 80/20 split for the data is done randomly. So if I repeated the random sampling of the data (into train/test cases) a hundred times and perform the cross validation, I find that the optimal SVM parameters are changing.

So basically, I am confused about how should I split my data, when I do it randomly I don't get repeatable results on each sample. What do I do?

I am using libsvm with RBF kernel. An example of sampling the data 30 times gives me the following :

The text is not formatted properly so I am attaching a link to a text file containing the information. The values in the bracket are [C gamma].

http://goo.gl/jd0DNT

How do I choose the best training set and how do I choose the best parameters... Is there an intelligent way of doing it?


Solution

  • A general solution for similar reproducibility problems of random functions is to either

    Anyhow I think you try to outsmart cross validation with your first split.