pythonapache-sparkpysparkpython-unittest

AssertDataFrameEqual doesn't throw error with None dataframe in Pyspark


When I try to assert a dataframe using the PySpark API, if a dataframe is none, I do not get the assertion error, but instead, the method returns false. Is it a bug, or should I handle my test verification differently?

from pyspark.testing.utils import assertDataFrameEqual
assertDataFrameEqual(spark.createDataFrame([("v1", "v3")]), None) 
# False

from pyspark.testing.utils import assertDataFrameEqual
assertDataFrameEqual(spark.createDataFrame([("v1", "v3")]), spark.createDataFrame([("v1", "v2")]))
# PySparkAssertionError

Solution

  • When running the code you run, I do get an error for your code:

    [INVALID_TYPE_DF_EQUALITY_ARG] Expected type Union[DataFrame, ps.DataFrame, List[Row]] for `expected` but got type None.
    

    I do not see why this could be different for you: all versions containing this function start with the same check for None values, as found in the code:

        if actual is None and expected is None:
            return True
        elif actual is None:
            raise PySparkAssertionError(
                error_class="INVALID_TYPE_DF_EQUALITY_ARG",
                message_parameters={
                    "expected_type": "Union[DataFrame, ps.DataFrame, List[Row]]",
                    "arg_name": "actual",
                    "actual_type": None,
                },
            )
        elif expected is None:
            raise PySparkAssertionError(
                error_class="INVALID_TYPE_DF_EQUALITY_ARG",
                message_parameters={
                    "expected_type": "Union[DataFrame, ps.DataFrame, List[Row]]",
                    "arg_name": "expected",
                    "actual_type": None,
                },
            )
    

    This means that in your case it should return the error I got, getting raised by the last elif.

    Despite it being unclear what the reason is for your different behaviour, a valid comparison based on the types would be wrapping the None in a list.

    Edit: Note that the code was run on Azure Databricks DBR 14.3LTS, pyspark 3.5.0.