pythoncsvscikit-learnarff

How to import csv or arff to scikit?


I have two dataset in csv and arff format which I have been using in classification models in weka. I was wondering if this formats can be used in scikit to try others classification methods in python.

This is how my dataset looks like: ASSAY_CHEMBLID...MDEN.23...MA,TARGET_TYPE...No...MA,TARGET_TYPE...apol...MA,TARGET_TYPE...ATSm5...MA,TARGET_TYPE...SCH.6...MA,TARGET_TYPE...SPC.6...MA,TARGET_TYPE...SP.3...MA,TARGET_TYPE...MDEN.12...MA,TARGET_TYPE...MDEN.22...MA,TARGET_TYPE...MLogP...MA,TARGET_TYPE...R...MA,TARGET_TYPE...G...MA,TARGET_TYPE...I...MA,ORGANISM...No...MA,ORGANISM...C2SP1...MA,ORGANISM...VC.6...MA,ORGANISM...ECCEN...MA,ORGANISM...khs.aasC...MA,ORGANISM...MDEC.12...MA,ORGANISM...MDEC.13...MA,ORGANISM...MDEC.23...MA,ORGANISM...MDEC.33...MA,ORGANISM...MDEO.11...MA,ORGANISM...MDEN.22...MA,ORGANISM...topoShape...MA,ORGANISM...WPATH...MA,ORGANISM...P...MA,Lij 0.202796,0.426972,0.117596,0.143818,0.072542,0.158172,0.136301,0.007245,0.016986,0.488281,0.300438,0.541931,0.644161,0.048149,0.02002,0,0.503415,0.153457,0.288099,0.186024,0.216833,0.184642,0,0.011592,0.00089,0,0.209406,0

where Lij is my class identificator (0 or 1). I was wondering if a previous transformation with numpy is needed.


Solution

  • To read ARFF files, you'll need to install liac-arff. see the link for details. once you have that installed, then use the following code to read the ARFF file

    import arff
    import numpy as np
    # read arff data
    with open("file.arff") as f:
        # load reads the arff db as a dictionary with
        # the data as a list of lists at key "data"
        dataDictionary = arff.load(f)
        f.close()
    # extract data and convert to numpy array
    arffData = np.array(dataDictionary['data'])
    

    There are several ways in which csv data can be read, I found that the easiest is using the function read_csv from the Python's module Pandas. See the link for details regarding installation. The code for reading a csv data file is below

    # read csv data
    import pandas as pd
    csvData = pd.read_csv("filename.csv",sep=',').values
    

    In either cases, you'll have a numpy array with your data. since the last column represents the (classes/target /ground truth/labels). you'll need to separate the data to a features array X and target vector y. e.g.

    X = arffData[:, :-1]
    y = arffData[:, -1]
    

    where X contains all the data in arffData except for the last column and y contains the last column in arffData

    Now you can use any supervised learning binary classifier from scikit-learn.