pythonimportloadformatsvmlight

Load svmlight format error


When I try to use the svmlight python package with data I already converted to svmlight format I get an error. It should be pretty basic, I don't understand what's happening. Here's the code:

import svmlight
training_data = open('thedata', "w")
model=svmlight.learn(training_data, type='classification', verbosity=0)

I've also tried:

training_data = numpy.load('thedata')

and

training_data = __import__('thedata')

Solution

  • One obvious problem is that you are truncating your data file when you open it because you are specifying write mode "w". This means that there will be no data to read.

    Anyway, you don't need to read the file like that if your data file is like the one in this example, you need to import it because it is a python file. This should work:

    import svmlight
    from data import train0 as training_data    # assuming your data file is named data.py
    # or you could use __import__()
    #training_data = __import__('data').train0
    
    model = svmlight.learn(training_data, type='classification', verbosity=0)
    

    You might want to compare your data against that of the example.

    Edit after data file format clarified

    The input file needs to be parsed into a list of tuples like this:

    [(target, [(feature_1, value_1), (feature_2, value_2), ... (feature_n, value_n)]),
     (target, [(feature_1, value_1), (feature_2, value_2), ... (feature_n, value_n)]),
     ...
    ]
    

    The svmlight package does not appear to support reading from a file in the SVM file format, and there aren't any parsing functions, so it will have to be implemented in Python. SVM files look like this:

    <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
    

    so here is a parser that converts from the file format to that required by the svmlight package:

    def svm_parse(filename):
    
        def _convert(t):
            """Convert feature and value to appropriate types"""
            return (int(t[0]), float(t[1]))
    
        with open(filename) as f:
            for line in f:
                line = line.strip()
                if not line.startswith('#'):
                    line = line.split('#')[0].strip() # remove any trailing comment
                    data = line.split()
                    target = float(data[0])
                    features = [_convert(feature.split(':')) for feature in data[1:]]
                    yield (target, features)
    

    And you can use it like this:

    import svmlight
    
    training_data = list(svm_parse('thedata'))
    model=svmlight.learn(training_data, type='classification', verbosity=0)