So, I ran a simple Deequ check in Spark, that went something like this :
val verificationResult: VerificationResult = { VerificationSuite()
.onData(dataset)
.addCheck(
Check(CheckLevel.Error, "Review Check")
.isComplete("col1")
.isUnique("col2")
.hasSize(_ == count_date)
.satisfies(
"abs(col4 - col5) <= 0.20 * col5",
"value(col4) lies between value(col5)-20% and value(col5)+20%"
)
.run()
}
val result1 = checkResultsAsDataFrame(spark, verificationResult)
Now, my result1
dataframe looks something like this:
+------------+-----------+------------+--------------------+-----------------+--------------------+
| check|check_level|check_status| constraint|constraint_status| constraint_message|
+------------+-----------+------------+--------------------+-----------------+--------------------+
|Review Check| Error| Error|CompletenessConst...| Success| |
|Review Check| Error| Error|UniquenessConstra...| Failure|Value: 7.62664794...|
|Review Check| Error| Error|SizeConstraint(Si...| Success| |
|Review Check| Error| Success|ComplianceConstra...| Success| |
+------------+-----------+------------+--------------------+-----------------+--------------------+
I'm confused between the columns check_status
and constraint_status
. How are they different? The results of my checks should be in the latter one right? Then what does the former imply?
I couldn't find any clarity on this in the deequ blog either. Could someone please explain?
check_status
is the overal status for the Check
group you run. It depends on the CheckLevel
and the constraint status. If you look at the code :
val anyFailures = constraintResults.exists { _.status == ConstraintStatus.Failure }
val checkStatus = (anyFailures, level) match {
case (true, CheckLevel.Error) => CheckStatus.Error
case (true, CheckLevel.Warning) => CheckStatus.Warning
case (_, _) => CheckStatus.Success
}
If there is any failure in the constraints then check_status = CheckLevel
. Else it's a success.