pythoncsvdatasettxtarff

Getting "ParserError" when I try to read a .txt file using pd.read_csv()


I am trying to convert this dataset: COCOMO81 to arff.

Before converting to .arff, I am trying to convert it to .csv

I am following this LINK to do this.

I got that dataset from promise site. I copied the entire page to notepad as cocomo81.txt and now I am trying to convert that cocomo81.txt file to .csv using python. (I intend to convert the .csv file to .arff later using weka)

However, when I run

import pandas as pd
read_file = pd.read_csv(r"cocomo81.txt")

I get THIS ParserError.

To fix this, I followed this solution and modified my command to

read_file = pd.read_csv(r"cocomo81.txt",on_bad_lines='warn')

I got a bunch of warnings - you can see what it looks like here

and then I ran read_file.to_csv(r'.\cocomo81csv.csv',index=None)

But it seems that the fix for ParserError didn't work in my case because my cocomo81csv.csv file looks like THIS in Excel.

Can someone please help me understand where I am going wrong and how can I use datasets from the promise repository in .arff format?


Solution

  • Looks like it's a csv file with comments as the first lines. The comment lines are indicated by % characters, but also @(?), and the actual csv data starts at line 230.

    You should skip the first rows and manually set the column names, try something like this:

    # set column names manually
    col_names = ["rely", "data", "cplx", "time", "stor", "virt", "turn", "acap", "aexp", "pcap", "vexp", "lexp", "modp", "tool", "sced", "loc", "actual" ]
    filename = "cocomo81.arff.txt"
    
    # read csv data
    df = pd.read_csv(filename, skiprows=229, sep=',', decimal='.', header=None, names=col_names)
    
    print(df)