I'm testing using Great Expectation to get invalid records when they violate the defined rules. From the documentation it says we can specify include_unexpected_rows or return_unexpected_index_query in the result format. However, none of them work for me. I'm applying the expectation on spark data frame, below is my code:
import great_expectations as ge
from great_expectations.dataset.sparkdf_dataset import SparkDFDataset
df = spark.read.table("data_quality_test")
df_ge = SparkDFDataset(df)
result_format={
"result_format": "COMPLETE",
"include_unexpected_rows": True
}
result = df_ge.expect_column_values_to_be_in_type_list("page_title", ["DateType"], result_format=result_format)
print(result)
Could anyone please help in figuring out what could be the problem?
I think there are two things going in in your example:
expect_column_values_to_be_in_type_list
in spark will just check the type of the whole column.expect_column_values_to_be_in_set
so it will check row-wise):import great_expectations as gx
context = gx.get_context()
asset = context.sources.add_spark("spark").add_dataframe_asset("data_quality_test")
df = spark.read.table("data_quality_test")
validator = context.get_validator(batch_request=asset.build_batch_request(dataframe=df))
result_format={
"result_format": "COMPLETE",
"include_unexpected_rows": True
}
result = validator.expect_column_values_to_be_in_set("page_title", ["foo"], result_format=result_format)
print(result)