pysparkpyspark-schema

Spark Merge schema, correcting datatypes (timestamp, string)


I was reading a spark DF with options below:

testDF = spark.read.format("parquet").option("header", "true") \
        .option("mergeSchema", "true").option("inferSchema", "true").load("folderPath/*/*")

However, this fails because one of the col (Date) is of type timestamp in some source files and is of type string in some files. I don't have control over the data producers, so wanted to know how can I handle this while processing.

Challenge is it's randomly either timestamp or string across files.

Thanks.


Solution

  • Since you're using spark to read parquet file, one of the advantages is that you can use schema-on-read on the fly approach, which means that you can declare the schema when you read the data. You can:

    schema = types.StructType([
        types.StructField('date', types.TimestampType()),
        ... # declartion of other columns
    ])
    
    testDF = spark.read.format("parquet")\
        .option('mergeSchema', 'true')\
        .schema(schema=schema)\
        .load("folderPath/*/*")