scalacsvapache-spark

How to correctly read a CSV file while escaping delimiter comma placed within square brackets using Apache Spark and Scala?


I have a following CSV which is published by a third party with the values for a specific column containing a comma (for some unexplainable reason). The values for the column is either absent or enclosed inside square brackets/double quotes as it represents a range.

Following is one of such record from the CSV:

A,B
xxxxxxxxx,"['05-01', '06-30']"
yyyyyyyyy,"['04-01', '04-30']"
zzzzzzzzz,

The culprit is the second column as obvious. Is there a way to correctly parse this CSV in Apache Spark (Scala) so as to have the following dataframe:

+---+----------+------------------------+
|A             |B                       |
+---+-----------------------------------+
|xxxxxxxxx     |"['05-01', '06-30']"    |
|yyyyyyyyy     |"['04-01', '04-30']"    |
|zzzzzzzzz     |null                    |
+---+----------+------------------------+

Solution

  • The default values of delimiter and quote options allow you to parse given csv correctly:

    scala> scala.io.Source.fromFile("source.csv").mkString
    res2: String =
    "A,B
    xxxxxxxxx,"['05-01', '06-30']"
    yyyyyyyyy,"['04-01', '04-30']"
    zzzzzzzzz,
    "
    
    scala> val df = spark.read.option("header", "true").csv("source.csv")
    df: org.apache.spark.sql.DataFrame = [A: string, B: string]
    
    scala> df.show()
    +---------+------------------+
    |        A|                 B|
    +---------+------------------+
    |xxxxxxxxx|['05-01', '06-30']|
    |yyyyyyyyy|['04-01', '04-30']|
    |zzzzzzzzz|              NULL|
    +---------+------------------+
    
    scala>
    

    NOTE that the value for B does not have the double quotes around each value. Which is the correct interpretation of given csv content per csv format.

    1. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:

      "aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx