scaladataframeapache-sparkamazon-deequ

What do the result dataframe's columns of a Deequ check signify?


So, I ran a simple Deequ check in Spark, that went something like this :

val verificationResult: VerificationResult = { VerificationSuite()
  .onData(dataset)
  .addCheck(
    Check(CheckLevel.Error, "Review Check")
      .isComplete("col1")
      .isUnique("col2")
      .hasSize(_ == count_date)
      .satisfies(
          "abs(col4 - col5) <= 0.20 * col5",
          "value(col4) lies between value(col5)-20% and value(col5)+20%"
    )
  .run()
}

val result1 = checkResultsAsDataFrame(spark, verificationResult)

Now, my result1 dataframe looks something like this:

+------------+-----------+------------+--------------------+-----------------+--------------------+
|       check|check_level|check_status|          constraint|constraint_status|  constraint_message|
+------------+-----------+------------+--------------------+-----------------+--------------------+
|Review Check|      Error|       Error|CompletenessConst...|          Success|                    |
|Review Check|      Error|       Error|UniquenessConstra...|          Failure|Value: 7.62664794...|
|Review Check|      Error|       Error|SizeConstraint(Si...|          Success|                    |
|Review Check|      Error|     Success|ComplianceConstra...|          Success|                    |
+------------+-----------+------------+--------------------+-----------------+--------------------+

I'm confused between the columns check_status and constraint_status. How are they different? The results of my checks should be in the latter one right? Then what does the former imply?

I couldn't find any clarity on this in the deequ blog either. Could someone please explain?


Solution

  • check_status is the overal status for the Check group you run. It depends on the CheckLevel and the constraint status. If you look at the code :

    val anyFailures = constraintResults.exists { _.status == ConstraintStatus.Failure }
    
    val checkStatus = (anyFailures, level) match {
      case (true, CheckLevel.Error) => CheckStatus.Error
      case (true, CheckLevel.Warning) => CheckStatus.Warning
      case (_, _) => CheckStatus.Success
    }
    

    If there is any failure in the constraints then check_status = CheckLevel. Else it's a success.