pythonapache-sparkapache-spark-dataset

Spark DataFrame casting string to date results in null values


I get null when I attempt to cast string date in Spark DataFrame to date type.

# Create a list of data
data = [(1, "20230517"), (2, "20230518"), (3, "20230519"), (4, "null")]

# Create a DataFrame from the list of data
df = spark.createDataFrame(data, ("id", "date"))

df.show()


df.printSchema()

root
 |-- id: long (nullable = true)
 |-- date: string (nullable = true)


# Convert the SaleDate column to datetime format
df1 = df.withColumn("date", df.date.cast('date'))
df1.select('date').show()

+--------+
|date    |
+--------+
|    null|
|    null|
|    null|
|    null|

Solution

  • For this operation you should use F.to_date() and specify the format which you want to parse (yyyyMMdd in your case):

    F.to_date('date', format='yyyyMMdd')
    

    Full code I used:

    from pyspark.sql import SparkSession
    import pyspark.sql.functions as F
    
    spark = SparkSession.builder.appName('spark_session').getOrCreate()
    
    # Create a list of data
    data = [(1, "20230517"), (2, "20230518"), (3, "20230519"), (4, "null")]
    
    # Create a DataFrame from the list of data
    df = spark.createDataFrame(data, ("id", "date"))
    
    # Convert the SaleDate column to datetime format
    df1 = df.withColumn("date", F.to_date('date', format='yyyyMMdd'))
    df1.select('date').show()
    
    +----------+
    |      date|
    +----------+
    |2023-05-17|
    |2023-05-18|
    |2023-05-19|
    |      null|
    +----------+