I'm new to using Spark's MLLib Python API. I have my data in CSV format like so:
Label 0 1 2 3 4 5 6 7 8 9 ... 758 759 760 761 762 763 764 765 766 767
0 -0.168307 -0.277797 -0.248202 -0.069546 0.176131 -0.152401 0.12664 -0.401460 0.125926 0.279061 ... -0.289871 0.207264 -0.140448 -0.426980 -0.328994 0.328007 0.486793 0.222587 0.650064 -0.513640
3 -0.313138 -0.045043 0.279587 -0.402598 -0.165238 -0.464669 0.09019 0.008703 0.074541 0.142638 ... -0.094025 0.036567 -0.059926 -0.492336 -0.006370 0.108954 0.350182 -0.144818 0.306949 -0.216190
2 -0.379293 -0.340999 0.319142 0.024552 0.142129 0.042989 -0.60938 0.052103 -0.293400 0.162741 ... 0.108854 -0.025618 0.149078 -0.917385 0.110629 0.146427
Can I use this as is by loading it using df = spark.read.format("csv").option("header", "true").load("file.csv")
? I'm attempting to train a Random Forest model. I've tried researching it, but it doesn't seem to be a big topic. I don't want to just attempt it without being fully sure it would work because the cluster I use has long queue times.
Yes! You'll want to infer the schema too.
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("file.csv")
If you have many files with the same column names and data types, save the schema to reuse.
schema = df.schema
And then next time you read a csv file with the same columns, you can
df = spark.read.format("csv").option("header", "true").option("schema", schema).load("file.csv")