I'm trying to run the sample code for pattern check "hasPattern()" with PyDeequ and it fails with Exception
The code:
import pydeequ
from pyspark.sql import SparkSession, Row
spark = (SparkSession
.builder
.config("spark.jars.packages", pydeequ.deequ_maven_coord)
.config("spark.jars.excludes", pydeequ.f2j_maven_coord)
.getOrCreate())
df = spark.sparkContext.parallelize([
Row(a="foo", creditCard="5130566665286573", email="foo@example.com", ssn="123-45-6789",
URL="http://userid@example.com:8080"),
Row(a="bar", creditCard="4532677117740914", email="bar@example.com", ssn="123456789",
URL="http://example.com/(something)?after=parens"),
Row(a="baz", creditCard="3401453245217421", email="foobar@baz.com", ssn="000-00-0000",
URL="http://userid@example.com:8080")]).toDF()
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Error, "Integrity checks")
checkResult = VerificationSuite(spark) \
.onData(df) \
.addCheck(
check.hasPattern(column='email',
pattern=r".*@baz.com",
assertion=lambda x: x == 1 / 3)) \
.run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show()
After run I recieve:
AttributeError: 'NoneType' object has no attribute '_Check'
on line
check.hasPattern(column='email',
pattern=r".*@baz.com",
assertion=lambda x: x == 1 / 3)
PyDeequ version: 1.0.1 Python version: Python 3.7.9
At this moment in time, it appears that the code on the pydeequ repository doesn't actually have the function definition fully fleshed out. It has a docstring that indicates the desired behavior, but it does not seem to have any accompanying code to do the actual work.
Without any code to do this test, the function will always return a value of None
(the default behavior for Python functions).
The correct expected behavior for the check methods in pydeequ is to return the check
object (represented by the self parameter), which will allow the user to daisy chain multiple checks in a sequence.
For comparison, I provide a snippet of code from the hasPattern
(which is not fully coded and only contains the docstring) method and the containsCreditCardNumber
method which appears to be fully implemented.
def hasPattern(self, column, pattern, assertion=None, name=None, hint=None):
"""
Checks for pattern compliance. Given a column name and a regular expression, defines a
Check on the average compliance of the column's values to the regular expression.
:param str column: Column in DataFrame to be checked
:param Regex pattern: A name that summarizes the current check and the
metrics for the analysis being done.
:param lambda assertion: A function with an int or float parameter.
:param str name: A name for the pattern constraint.
:param str hint: A hint that states why a constraint could have failed.
:return: hasPattern self: A Check object that runs the condition on the column.
"""
def containsCreditCardNumber(self, column, assertion=None, hint=None):
"""
Check to run against the compliance of a column against a Credit Card pattern.
:param str column: Column in DataFrame to be checked. The column is expected to be a string type.
:param lambda assertion: A function with an int or float parameter.
:param hint hint: A hint that states why a constraint could have failed.
:return: containsCreditCardNumber self: A Check object that runs the compliance on the column.
"""
assertion = (
ScalaFunction1(self._spark_session.sparkContext._gateway, assertion)
if assertion
else getattr(self._Check, "containsCreditCardNumber$default$2")()
)
hint = self._jvm.scala.Option.apply(hint)
self._Check = self._Check.containsCreditCardNumber(column, assertion, hint)
return self