I have a following CSV which is published by a third party with the values for a specific column containing a comma (for some unexplainable reason). The values for the column is either absent or enclosed inside square brackets/double quotes as it represents a range.
Following is one of such record from the CSV:
A,B
xxxxxxxxx,"['05-01', '06-30']"
yyyyyyyyy,"['04-01', '04-30']"
zzzzzzzzz,
The culprit is the second column as obvious. Is there a way to correctly parse this CSV in Apache Spark (Scala) so as to have the following dataframe:
+---+----------+------------------------+
|A |B |
+---+-----------------------------------+
|xxxxxxxxx |"['05-01', '06-30']" |
|yyyyyyyyy |"['04-01', '04-30']" |
|zzzzzzzzz |null |
+---+----------+------------------------+
The default values of delimiter
and quote
options allow you to parse given csv correctly:
scala> scala.io.Source.fromFile("source.csv").mkString
res2: String =
"A,B
xxxxxxxxx,"['05-01', '06-30']"
yyyyyyyyy,"['04-01', '04-30']"
zzzzzzzzz,
"
scala> val df = spark.read.option("header", "true").csv("source.csv")
df: org.apache.spark.sql.DataFrame = [A: string, B: string]
scala> df.show()
+---------+------------------+
| A| B|
+---------+------------------+
|xxxxxxxxx|['05-01', '06-30']|
|yyyyyyyyy|['04-01', '04-30']|
|zzzzzzzzz| NULL|
+---------+------------------+
scala>
NOTE that the value for B
does not have the double quotes around each value. Which is the correct interpretation of given csv content per csv format.
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:
"aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx