How to import csv or arff to scikit?

I have two dataset in csv and arff format which I have been using in classification models in weka. I was wondering if this formats can be used in scikit to try others classification methods in python.

This is how my dataset looks like: ASSAY_CHEMBLID...MDEN.23...MA,TARGET_TYPE...No...MA,TARGET_TYPE...apol...MA,TARGET_TYPE...ATSm5...MA,TARGET_TYPE...SCH.6...MA,TARGET_TYPE...SPC.6...MA,TARGET_TYPE...SP.3...MA,TARGET_TYPE...MDEN.12...MA,TARGET_TYPE...MDEN.22...MA,TARGET_TYPE...MLogP...MA,TARGET_TYPE...R...MA,TARGET_TYPE...G...MA,TARGET_TYPE...I...MA,ORGANISM...No...MA,ORGANISM...C2SP1...MA,ORGANISM...VC.6...MA,ORGANISM...ECCEN...MA,ORGANISM...khs.aasC...MA,ORGANISM...MDEC.12...MA,ORGANISM...MDEC.13...MA,ORGANISM...MDEC.23...MA,ORGANISM...MDEC.33...MA,ORGANISM...MDEO.11...MA,ORGANISM...MDEN.22...MA,ORGANISM...topoShape...MA,ORGANISM...WPATH...MA,ORGANISM...P...MA,Lij 0.202796,0.426972,0.117596,0.143818,0.072542,0.158172,0.136301,0.007245,0.016986,0.488281,0.300438,0.541931,0.644161,0.048149,0.02002,0,0.503415,0.153457,0.288099,0.186024,0.216833,0.184642,0,0.011592,0.00089,0,0.209406,0

where Lij is my class identificator (0 or 1). I was wondering if a previous transformation with numpy is needed.

Solution

To read ARFF files, you'll need to install liac-arff. see the link for details. once you have that installed, then use the following code to read the ARFF file

import arff
import numpy as np
# read arff data
with open("file.arff") as f:
    # load reads the arff db as a dictionary with
    # the data as a list of lists at key "data"
    dataDictionary = arff.load(f)
    f.close()
# extract data and convert to numpy array
arffData = np.array(dataDictionary['data'])

There are several ways in which csv data can be read, I found that the easiest is using the function read_csv from the Python's module Pandas. See the link for details regarding installation. The code for reading a csv data file is below

# read csv data
import pandas as pd
csvData = pd.read_csv("filename.csv",sep=',').values

In either cases, you'll have a numpy array with your data. since the last column represents the (classes/target /ground truth/labels). you'll need to separate the data to a features array X and target vector y. e.g.

X = arffData[:, :-1]
y = arffData[:, -1]

where X contains all the data in arffData except for the last column and y contains the last column in arffData

Now you can use any supervised learning binary classifier from scikit-learn.