pythonapache-sparkpysparkaws-glueapache-spark-3.0

pyspark trimming all fields bydefault while writing into csv in python


I am trying to write the dataset into csv file using spark 3.3 , Scala 2 python code and bydefault its trimming all the String fields. For example, for the below column values :

" Text123"," jacob "

the output in csv is:

"Text123","jacob"

I dont want to trim any String fields.

Below is my code:

args = getResolvedOptions(sys.argv, ['target_BucketName', 'JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Convert DynamicFrame to DataFrame 
df_app = AWSGlueDataCatalog_node.toDF()

# Repartition the DataFrame to control output files APP
df_repartitioned_app = df_app.repartition(10)  

# Check for empty partitions and write only if data is present
if not df_repartitioned_app.rdd.isEmpty():
    df_repartitioned_app.write.format("csv") \
        .option("compression", "gzip") \
        .option("header", "true") \
        .option("delimiter", "|") \
        .save(output_path_app)

Solution

  • set the ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace options to false. spark by default ignores whitespaces when it's writing:

        df_repartitioned_app.write.format("csv") \
            .option("compression", "gzip") \
            .option("header", "true") \
            .option("delimiter", "|") \
            .option("ignoreLeadingWhiteSpace", "false") \
            .option("ignoreTrailingWhiteSpace", "false") \
            .save(output_path_app)