pythonshogun

Loading data with shogun toolbox


I'm trying to use shogun toolbox in order to classify people in this dataset as drowned or not.

I would like to use shogun engines like CFIle, LibSVMFile, SparseRealFeatures, etc... as mentioned in the shogun introduction but I'm getting stucked.

First of all, in this introduction you load directly a LibSVMFile in that format but, the autor don't mention how they generate the data file from CSV format (which is the original format of the dataset he uses)...

As I don't have a dataset in the required format I have tried to load my dataset with CFile class, or even better, with the CCSVFile class, but I got

NameError: name 'CFile' is not defined

and

NameError: name 'CCSVFile' is not defined

(I'm using shogun throught Python3 compiled from source in Ubuntu 17.10 and I'm importing all shogun with "from shogun import *")

Nevertheless, when I use

data_file=LibSVMFile(os.path.join(SHOGUN_DATA_DIR, 'train.csv'))

as in the example, there are no error about non defined classes but as expected it degenerates in:

[1]    8870 segmentation fault (core dumped)  python3 titanic.py

I would like to know what's the correct way to use this shogun engines in order to load datasets...

In other shogun notebook they didn't use them and just load dataset using others libraries and I'm starting to think is the best way.


Solution

  • to read a CSV file you should run the following (in python):

    import shogun as sg
    train_csv = sg.CSV("train.csv")
    

    but note that the file contains a lot of categoricals that needs encoding, so first you should do some data munging before actually trying to use it in shogun models.