So, I ran a simple Deequ check in Spark, that went something like this :
val verificationResult: VerificationResult = { VerificationSuite()
Check(CheckLevel.Error, "Review Check")
.hasSize(_ == count_date)
"abs(col4 - col5) <= 0.20 * col5",
"value(col4) lies between value(col5)-20% and value(col5)+20%"
val result1 = checkResultsAsDataFrame(spark, verificationResult)
Now, my result1
dataframe looks something like this:
| check|check_level|check_status| constraint|constraint_status| constraint_message|
|Review Check| Error| Error|CompletenessConst...| Success| |
|Review Check| Error| Error|UniquenessConstra...| Failure|Value: 7.62664794...|
|Review Check| Error| Error|SizeConstraint(Si...| Success| |
|Review Check| Error| Success|ComplianceConstra...| Success| |
I'm confused between the columns check_status
and constraint_status
. How are they different? The results of my checks should be in the latter one right? Then what does the former imply?
I couldn't find any clarity on this in the deequ blog either. Could someone please explain?
is the overal status for the Check
group you run. It depends on the CheckLevel
and the constraint status. If you look at the code :
val anyFailures = constraintResults.exists { _.status == ConstraintStatus.Failure }
val checkStatus = (anyFailures, level) match {
case (true, CheckLevel.Error) => CheckStatus.Error
case (true, CheckLevel.Warning) => CheckStatus.Warning
case (_, _) => CheckStatus.Success
If there is any failure in the constraints then check_status = CheckLevel
. Else it's a success.