pysparkaws-gluedata-quality

How to get rows that failed CustomSql data quality check in AWS Glue


According to this documentation page, AWS Glue can now detect rows that failed a CustomSql data quality check.

I tried it and I am not seeing the rows that failed but only a % of failed data. Here is my code:

from pyspark.sql import SparkSession
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsgluedq.transforms import EvaluateDataQuality
from pyspark.sql import SparkSession, Row
from pyspark.sql.dataframe import DataFrame


spark = SparkSession.builder.getOrCreate() 
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')

glue_ctx = GlueContext(spark.sparkContext)

df = spark.createDataFrame([Row(campagne='dzdzd', num=5),Row(campagne='["campagne1", "campagne2"]', num=5),Row(campagne='["campagne1", "campagne2"]', num=1)])
dynamic_df = DynamicFrame.fromDF(df, glue_ctx)

EvaluateDataQuality.apply(
        frame=dynamic_df,
        ruleset="""
    Rules = [
        CustomSql "select campagne from primary where num = 5"
    ]
    """,
        publishing_options={}).toDF().show(truncate=False)

This returning the following DataFrame:

+------------------------------------------------------+-------+---------------------------------------------------+------------------------------------------------------+------------------------------------------------------+
|Rule                                                  |Outcome|FailureReason                                      |EvaluatedMetrics                                      |EvaluatedRule                                         |
+------------------------------------------------------+-------+---------------------------------------------------+------------------------------------------------------+------------------------------------------------------+
|CustomSql "select campagne from primary where num = 5"|Failed |Custom SQL response failed to satisfy the threshold|{Dataset.*.CustomSQL.Compliance -> 0.6666666666666666}|CustomSql "select campagne from primary where num = 5"|
+------------------------------------------------------+-------+---------------------------------------------------+------------------------------------------------------+------------------------------------------------------+

Obviously, I would like to have the rows that failed and only a %.


Solution

  • For future reference:

    This is the answer for my question from AWS documentation:

    EvaluateDataQualityMultiframe = EvaluateDataQuality().process_rows(
        frame= dynamic_df,
        ruleset="""
        Rules = [
            CustomSql "select campagne from primary where num = 5"
        ]
        """,,
        publishing_options={
            "dataQualityEvaluationContext": "EvaluateDataQualityMultiframe",
            "enableDataQualityCloudWatchMetrics": False,
            "enableDataQualityResultsPublishing": False,
        },
        additional_options={"performanceTuning.caching": "CACHE_NOTHING"},
    )    
    
    RowLevelOutcomes = SelectFromCollection.apply(
    dfc=EvaluateDataQualityMultiframe,
    key="rowLevelOutcomes",
    transformation_ctx="rowLevelOutcomes",
    ).toDF().show(truncate=False)