pysparkgreat-expectations

Great expectation: get invalid records


I'm testing using Great Expectation to get invalid records when they violate the defined rules. From the documentation it says we can specify include_unexpected_rows or return_unexpected_index_query in the result format. However, none of them work for me. I'm applying the expectation on spark data frame, below is my code:

import great_expectations as ge
from great_expectations.dataset.sparkdf_dataset import SparkDFDataset

df = spark.read.table("data_quality_test")
df_ge = SparkDFDataset(df)
result_format={
        "result_format": "COMPLETE",
        "include_unexpected_rows": True
    }
result = df_ge.expect_column_values_to_be_in_type_list("page_title", ["DateType"],  result_format=result_format)
print(result)

Could anyone please help in figuring out what could be the problem?


Solution

  • I think there are two things going in in your example:

    1. To get the complete rows back, you need to have an expectation that evaluates individual rows, but expect_column_values_to_be_in_type_list in spark will just check the type of the whole column.
    2. You have to use the newer GX datasource API to get complete rows. It's a bit more verbose (I know that's being fixed in the new 1.0 api coming shortly), but it would look like this (notice I changed to expect_column_values_to_be_in_set so it will check row-wise):
    import great_expectations as gx
    
    context = gx.get_context()
    asset = context.sources.add_spark("spark").add_dataframe_asset("data_quality_test")
    df = spark.read.table("data_quality_test")
    
    validator = context.get_validator(batch_request=asset.build_batch_request(dataframe=df))
    result_format={
            "result_format": "COMPLETE",
            "include_unexpected_rows": True
        }
    result = validator.expect_column_values_to_be_in_set("page_title", ["foo"], result_format=result_format)
    print(result)