pythonpandasdataframeapache-sparkpyspark

Convert PySpark Dataframe to Pandas Dataframe fails on timestamp column


I create my pyspark dataframe:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, BinaryType, ArrayType, StringType, TimestampType
    input_schema = StructType([
            StructField("key", StringType()),
            StructField("headers", ArrayType(
                StructType([
                    StructField("key", StringType()),
                    StructField("value", StringType())
                ])
            )),
            StructField("timestamp", TimestampType())
        ])
    
        input_data = [
            ("key1", [{"key": "header1", "value": "value1"}], datetime(2023, 1, 1, 0, 0, 0)),
            ("key2", [{"key": "header2", "value": "value2"}], datetime(2023, 1, 1, 0, 0, 0)),
            ("key3", [{"key": "header3", "value": "value3"}], datetime(2023, 1, 1, 0, 0, 0))
        ]
    
        df = spark.createDataFrame(input_data, input_schema)

I want to use Pandas' assert_frame_equal(), so I want to convert my dataframe to a Pandas dataframe.

df.toPandas() will throw TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.

How can I successfully convert the "timestamp" column in order to not lose detail of the datetime value? I need them to remain to 2023-01-01 00:00:00 and not 2023-01-01.


Solution

  • I found the solution:

    from pyspark.sql.functions import date_format
    
    df = df.withColumn("timestamp", date_format("timestamp", "yyyy-MM-dd HH:mm:ss")).toPandas()
    

    Now I was able to use

    assert_frame_equal(df, test_df)
    

    successfully. It did not lose precision.