Hope everyone is doing well.
While going through the spark csv datasource options for the question I am quite confused on the difference between the various quote related options available.
Used the example mentioned in the linked question to understand the differences, but still a little confused. Thank you for all the help.
Let's run through a few examples.
val df = Seq(
(1, "aaa"),
(2, "b,c"),
(3, "d$e"),
(4, "f%g")
).toDF("x", "y")
df.show
+---+---+
| x| y|
+---+---+
| 1|aaa|
| 2|b,c|
| 3|d$e|
| 4|f%g|
+---+---+
Whenever you have commas in values that would be indistinguishable from field delimiters (like 2,b,c
) Spark can quote such values. It's double quotes by default (like 2,"b,c"
), but you can customize it using quote option.
df.coalesce(1).write.mode("overwrite")
.option("quote", "$")
.csv("x")
1,aaa
2,$b,c$
3,$d\$e$
4,f%g
We used dollar sign here. Not only b,c
was quoted, but since d$e
contained a dollar, it was quoted as well and the dollar itself was escaped.
I don't know why you would like to do that, but you can ask Spark not to escape fields containing quote character using escapeQuotes option.
df.coalesce(1).write.mode("overwrite")
.option("quote", "$")
.option("escapeQuotes", false)
.csv("x")
1,aaa
2,$b,c$
3,d$e
4,f%g
Here, d$e
remained unquoted.
You can also force it to quote every field using quoteAll.
df.coalesce(1).write.mode("overwrite")
.option("quote", "$")
.option("quoteAll", true)
.csv("x1")
$1$,$aaa$
$2$,$b,c$
$3$,$d\$e$
$4$,$f%g$
Note that quoteAll = true makes escapeQuotes irrelevant.
You can customize the escape character for the quote character inside quote using escape, if you don't like the default backslash.
df.coalesce(1).write.mode("overwrite")
.option("quote", "$")
.option("escape", "%")
.csv("x1")
1,aaa
2,$b,c$
3,$d%$e$
4,$f%g$
Finally, note that there's a standalone escape character in f%g
, so if you would like to escape that, use charToEscapeQuoteEscaping.
df.coalesce(1).write.mode("overwrite")
.option("quote", "$")
.option("escape", "%")
.option("quoteAll", true)
.option("charToEscapeQuoteEscaping", "^")
.csv("x1")
$1$,$aaa$
$2$,$b,c$
$3$,$d%$e$
$4$,$f^%g$
No, you can't go deeper and escape ^
...