pythonmachine-learningclassificationsvmlight

How do I format IRIS data set for input to SVM-Light library?


I'm trying to use the SVM-Light library for training and classification of the IRIS dataset. Here is the python wrapper that I'm using. I'm currently following the example on the page but I'm not sure how to format the IRIS data correctly for input. A sample row in the IRIS datset looks like 5.0,3.6,1.4,0.2,Iris-setosa.


Solution

  • I don't know your library, but i highly recommend using scikit-learn, a powerful general-purpose ML-lib. I suppose you got a good reason to use svmlight, otherwise, libsvm- or liblinear-based usage is much much easier with sklearn (fully automatic; not file-based and automatic multi-class and co.).

    Here is some simple example. Keep in mind, that only binary-targets are supported imho, and if you need multi-class learning, you would use sklearn's multiclass tools.

    Code to load and prepare Iris

    from sklearn.datasets import load_iris
    from sklearn.datasets import dump_svmlight_file
    
    iris = load_iris()
    X = iris.data
    y = iris.target
    
    """ only keep first two classes """
    indices = y<=1
    X = X[indices]
    y = y[indices]
    
    """ transform to +1 / -1 targets (0 -> -1) """
    y[y==0] = -1
    
    dump_svmlight_file(X, y, 'my_dataset', zero_based=False)  # 1-based!!!
    

    svmlight call

    ./svm_learn my_dataset my_output -v3
    Scanning examples...done
    Reading examples into memory...100..OK. (100 examples read)
    Setting default regularization parameter C=0.0199
    Optimizing...............done. (16 iterations)
    Optimization finished (0 misclassified, maxdiff=0.00057).
    Runtime in cpu-seconds: 0.00
    Number of SV: 32 (including 28 at upper bound)
    L1 loss: loss=4.89469
    Norm of weight vector: |w|=0.69732
    Norm of longest example vector: |x|=9.13674
    Estimated VCdim of classifier: VCdim<=31.50739
    Computing XiAlpha-estimates...done
    Runtime for XiAlpha-estimates in cpu-seconds: 0.00
    XiAlpha-estimate of the error: error<=30.00% (rho=1.00,depth=0)
    XiAlpha-estimate of the recall: recall=>70.00% (rho=1.00,depth=0)
    XiAlpha-estimate of the precision: precision=>70.00% (rho=1.00,depth=0)
    Number of kernel evaluations: 1291
    Writing model file...done