I create my pyspark dataframe:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, BinaryType, ArrayType, StringType, TimestampType
input_schema = StructType([
StructField("key", StringType()),
StructField("headers", ArrayType(
StructType([
StructField("key", StringType()),
StructField("value", StringType())
])
)),
StructField("timestamp", TimestampType())
])
input_data = [
("key1", [{"key": "header1", "value": "value1"}], datetime(2023, 1, 1, 0, 0, 0)),
("key2", [{"key": "header2", "value": "value2"}], datetime(2023, 1, 1, 0, 0, 0)),
("key3", [{"key": "header3", "value": "value3"}], datetime(2023, 1, 1, 0, 0, 0))
]
df = spark.createDataFrame(input_data, input_schema)
I want to use Pandas' assert_frame_equal()
, so I want to convert my dataframe to a Pandas dataframe.
df.toPandas()
will throw TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.
How can I successfully convert the "timestamp" column in order to not lose detail of the datetime value? I need them to remain to 2023-01-01 00:00:00
and not 2023-01-01
.
I found the solution:
from pyspark.sql.functions import date_format
df = df.withColumn("timestamp", date_format("timestamp", "yyyy-MM-dd HH:mm:ss")).toPandas()
Now I was able to use
assert_frame_equal(df, test_df)
successfully. It did not lose precision.