According to this documentation page, AWS Glue can now detect rows that failed a CustomSql data quality check.
I tried it and I am not seeing the rows that failed but only a % of failed data. Here is my code:
from pyspark.sql import SparkSession
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsgluedq.transforms import EvaluateDataQuality
from pyspark.sql import SparkSession, Row
from pyspark.sql.dataframe import DataFrame
spark = SparkSession.builder.getOrCreate()
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
glue_ctx = GlueContext(spark.sparkContext)
df = spark.createDataFrame([Row(campagne='dzdzd', num=5),Row(campagne='["campagne1", "campagne2"]', num=5),Row(campagne='["campagne1", "campagne2"]', num=1)])
dynamic_df = DynamicFrame.fromDF(df, glue_ctx)
EvaluateDataQuality.apply(
frame=dynamic_df,
ruleset="""
Rules = [
CustomSql "select campagne from primary where num = 5"
]
""",
publishing_options={}).toDF().show(truncate=False)
This returning the following DataFrame:
+------------------------------------------------------+-------+---------------------------------------------------+------------------------------------------------------+------------------------------------------------------+
|Rule |Outcome|FailureReason |EvaluatedMetrics |EvaluatedRule |
+------------------------------------------------------+-------+---------------------------------------------------+------------------------------------------------------+------------------------------------------------------+
|CustomSql "select campagne from primary where num = 5"|Failed |Custom SQL response failed to satisfy the threshold|{Dataset.*.CustomSQL.Compliance -> 0.6666666666666666}|CustomSql "select campagne from primary where num = 5"|
+------------------------------------------------------+-------+---------------------------------------------------+------------------------------------------------------+------------------------------------------------------+
Obviously, I would like to have the rows that failed and only a %.
For future reference:
This is the answer for my question from AWS documentation:
EvaluateDataQualityMultiframe = EvaluateDataQuality().process_rows(
frame= dynamic_df,
ruleset="""
Rules = [
CustomSql "select campagne from primary where num = 5"
]
""",,
publishing_options={
"dataQualityEvaluationContext": "EvaluateDataQualityMultiframe",
"enableDataQualityCloudWatchMetrics": False,
"enableDataQualityResultsPublishing": False,
},
additional_options={"performanceTuning.caching": "CACHE_NOTHING"},
)
RowLevelOutcomes = SelectFromCollection.apply(
dfc=EvaluateDataQualityMultiframe,
key="rowLevelOutcomes",
transformation_ctx="rowLevelOutcomes",
).toDF().show(truncate=False)