scalaapache-sparkpyspark

Pyspark write option to csv file for double quote not working properly


I am trying to write into csv file where i want fields to be delimeted with double quotes with | as separator which is not working correctly. The problem is i have few double quotes values for example "Nordenham " which is creating problem and i want to write it as it is into csv file.

For example below is my output:

|"""Nordenham """|"E"|"W"

Expected output into csv file should be :

|"Nordenham "|"E"|"W"

Below is my code:

df_repartitioned.write.format("csv") \
        .option("compression", "gzip") \
        .option("header", "true") \
        .option("delimiter", "|") \
        .option("ignoreLeadingWhiteSpace", "true") \
        .option("ignoreTrailingWhiteSpace","true") \
        .option("treatEmptyValuesAsNulls", "true") \
        .option("nullValue", "null") \
        .option("emptyValue", "null") \
        .option("quoteAll", true) \
        .option("escape", "\"") \
        .save(output_path)

I am not sure in some cases where i am trying to read the value inside double quote can have double quotes value with more spaces for example "Nordenham " or " Nordenham ". I am not sure how to handle it and write into csv file as it is.


Solution

  • Csv is designed in the way if you put something in it you will get exactly same thing back. So if you put "x" in it you may expect to get back exactly that, with quote characters intact.

    Imagine you have two stings: "x" and x. If both of them will be represented by "x" in csv how you propose to determine which string should be read after de-serialization from csv?

    So if you want to strip " from your values you need to do it yourself before passing the value into csv.