When I try to assert a dataframe using the PySpark API, if a dataframe is none, I do not get the assertion error, but instead, the method returns false. Is it a bug, or should I handle my test verification differently?
from pyspark.testing.utils import assertDataFrameEqual
assertDataFrameEqual(spark.createDataFrame([("v1", "v3")]), None)
# False
from pyspark.testing.utils import assertDataFrameEqual
assertDataFrameEqual(spark.createDataFrame([("v1", "v3")]), spark.createDataFrame([("v1", "v2")]))
# PySparkAssertionError
When running the code you run, I do get an error for your code:
[INVALID_TYPE_DF_EQUALITY_ARG] Expected type Union[DataFrame, ps.DataFrame, List[Row]] for `expected` but got type None.
I do not see why this could be different for you: all versions containing this function start with the same check for None values, as found in the code:
if actual is None and expected is None:
return True
elif actual is None:
raise PySparkAssertionError(
error_class="INVALID_TYPE_DF_EQUALITY_ARG",
message_parameters={
"expected_type": "Union[DataFrame, ps.DataFrame, List[Row]]",
"arg_name": "actual",
"actual_type": None,
},
)
elif expected is None:
raise PySparkAssertionError(
error_class="INVALID_TYPE_DF_EQUALITY_ARG",
message_parameters={
"expected_type": "Union[DataFrame, ps.DataFrame, List[Row]]",
"arg_name": "expected",
"actual_type": None,
},
)
This means that in your case it should return the error I got, getting raised by the last elif.
Despite it being unclear what the reason is for your different behaviour, a valid comparison based on the types would be wrapping the None in a list.
Edit: Note that the code was run on Azure Databricks DBR 14.3LTS, pyspark 3.5.0.