I have a dataframe df which contains 13 values separated with comma. I want to get in df2 a dataFrame which contains labeledPoint. first value is label, twelve others are features. I use a split and select method to divide string with 13 value into an array of 13 values. map method allow me to create labeledPoint. Error come when I use toDF() method to convert RDD to DataFrame
df2 = df.select(split(df[0], ',')).map(lambda x: LabeledPoint(float(x[0]),x[-12:])).toDF()
org.apache.spark.SparkException: Job aborted due to stage failure:
when I look in the stackerror I find:
IndexError: tuple index out of range.
in order to do test, I executed:
display(df.select(split(df[0], ',')))
I obtain my 13 values in an array for each row:
["2001.0","0.884123733793","0.610454259079","0.600498416968","0.474669212493","0.247232680947","0.357306088914","0.344136412234","0.339641227335","0.600858840135","0.425704689024","0.60491501652","0.419193351817"]
any Idea?
The Error come from the index x[0] should be replace by x[0][0]. So :
df2 = df.select(split(df[0], ',')).map(lambda x: LabeledPoint(float(x[0][0]), x[0][-12:])).toDF()