I get null
when I attempt to cast string date in Spark DataFrame to date
type.
# Create a list of data
data = [(1, "20230517"), (2, "20230518"), (3, "20230519"), (4, "null")]
# Create a DataFrame from the list of data
df = spark.createDataFrame(data, ("id", "date"))
df.show()
df.printSchema()
root
|-- id: long (nullable = true)
|-- date: string (nullable = true)
# Convert the SaleDate column to datetime format
df1 = df.withColumn("date", df.date.cast('date'))
df1.select('date').show()
+--------+
|date |
+--------+
| null|
| null|
| null|
| null|
For this operation you should use F.to_date()
and specify the format which you want to parse (yyyyMMdd
in your case):
F.to_date('date', format='yyyyMMdd')
Full code I used:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.appName('spark_session').getOrCreate()
# Create a list of data
data = [(1, "20230517"), (2, "20230518"), (3, "20230519"), (4, "null")]
# Create a DataFrame from the list of data
df = spark.createDataFrame(data, ("id", "date"))
# Convert the SaleDate column to datetime format
df1 = df.withColumn("date", F.to_date('date', format='yyyyMMdd'))
df1.select('date').show()
+----------+
| date|
+----------+
|2023-05-17|
|2023-05-18|
|2023-05-19|
| null|
+----------+