apache-spark

How to Preserve Double Quotes in Input Data When Reading with Spark DataFrame?


I have a field in the input file: "The Good, the Bad and the Ugly (From ""The Good, the Bad and the Ugly"")" and I need it to remain exactly as "The Good, the Bad and the Ugly (From ""The Good, the Bad and the Ugly"")" after reading it into a Spark DataFrame. However, I am encountering two different results:

When I use the following code:

df = spark.read \
  .option("header", "false") \
  .option("quote", '"') \
  .option("escape", '"') \
  .csv(output_path)

The result is: The Good, the Bad and the Ugly (From "The Good, the Bad and the Ugly")

When I use this code:

df = spark.read \
  .option("header", "false") \
  .option("quote", '"') \
  .csv(output_path)

The result is: "The Good, the Bad and the Ugly (From ""The Good| the Bad and the Ugly"")"

Can anyone suggest a workaround to ensure that the field remains exactly as "The Good, the Bad and the Ugly (From ""The Good, the Bad and the Ugly"")" at the reading step and the field doesn't get split at the comma? This needs to be handled during the reading process because at the writing step, it will be written out as a text file.


Solution

  • May be you can use withColumn with .concat() function after you have read like shown in the answer for this question.
    add double quotes at the start and end of each string of column pyspark