I followed https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/ and got running with the checks and verification etc.
But I am not able to find out , on which rows exactly my data is failing. That is a very important part , that I need the rows which have failed the check.
I tried following : https://github.com/awslabs/deequ/blob/master/src/test/scala/com/amazon/deequ/schema/RowLevelSchemaValidatorTest.scala But, I am getting error databricks while running codes from this link :
error: object SparkContextSpec is not a member of package com.amazon.deequ
import com.amazon.deequ.SparkContextSpec
^
command-4342528364312961:24: error: not found: type SparkContextSpec
class RowLevelSchemaValidatorTest extends WordSpec with SparkContextSpec {
^
command-4342528364312961:28: error: not found: value withSparkSession
"correctly enforce null constraints" in withSparkSession { sparkSession =>
^
command-4342528364312961:39: error: not found: value RowLevelSchema
val schema = RowLevelSchema()
^
command-4342528364312961:40: error: not found: value isNullable
.withIntColumn("id", isNullable = false)
And the list goes on.
Please help.
Thanks
The problems you are encountering are likely due to an incorrect project setup. Are you running the tests from your IDE? If not, I would recommend you make sure that the code, in IntelliJ for example, compiles. The unit-tests should then be executable from there.
IntelliJ comes with a Maven plugin that allows importing projects.