csvpysparkdata-science-experience

Read simple csv with PySpark


probably a silly issue, but I don't get it. I'm working on a Jupyter Notebook with Python3.6, Spark 2.4, hosted by IBM Watson Studio.

I have a simple csv file:

num,label
0,0
1,0
2,0
3,0

And to read it I use the following commands:

labels = spark.read.csv(url, sep=',', header=True)

But if I check if labels is correct, using labels.head(), I get Row(PAR1Љ��L�Q�� ='\x08\x00]')

What am I missing?


Solution

  • This looks like due to an encoding issue

    Try this with an encoding provided in the option,alo try with UTF-8

    labels = spark.read.csv(url, sep=',', header=True).option("encoding", "ISO-8859-1")