pythonpysparkamazon-deequ

PyDeequ hasPattern fails with 'PatternMatch' object has no attribute '_Check'


I'm trying to run the sample code for pattern check "hasPattern()" with PyDeequ and it fails with Exception

The code:

import pydeequ

from pyspark.sql import SparkSession, Row

spark = (SparkSession
         .builder
         .config("spark.jars.packages", pydeequ.deequ_maven_coord)
         .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
         .getOrCreate())

df = spark.sparkContext.parallelize([
    Row(a="foo", creditCard="5130566665286573", email="foo@example.com", ssn="123-45-6789",
        URL="http://userid@example.com:8080"),
    Row(a="bar", creditCard="4532677117740914", email="bar@example.com", ssn="123456789",
        URL="http://example.com/(something)?after=parens"),
    Row(a="baz", creditCard="3401453245217421", email="foobar@baz.com", ssn="000-00-0000",
        URL="http://userid@example.com:8080")]).toDF()

from pydeequ.checks import *
from pydeequ.verification import *

check = Check(spark, CheckLevel.Error, "Integrity checks")

checkResult = VerificationSuite(spark) \
    .onData(df) \
    .addCheck(
    check.hasPattern(column='email',
                     pattern=r".*@baz.com",
                     assertion=lambda x: x == 1 / 3)) \
    .run()

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show()

After run I recieve:

AttributeError: 'NoneType' object has no attribute '_Check'

on line

    check.hasPattern(column='email',
                     pattern=r".*@baz.com",
                     assertion=lambda x: x == 1 / 3)

PyDeequ version: 1.0.1 Python version: Python 3.7.9


Solution

  • At this moment in time, it appears that the code on the pydeequ repository doesn't actually have the function definition fully fleshed out. It has a docstring that indicates the desired behavior, but it does not seem to have any accompanying code to do the actual work.

    Without any code to do this test, the function will always return a value of None (the default behavior for Python functions).

    The correct expected behavior for the check methods in pydeequ is to return the check object (represented by the self parameter), which will allow the user to daisy chain multiple checks in a sequence.

    For comparison, I provide a snippet of code from the hasPattern (which is not fully coded and only contains the docstring) method and the containsCreditCardNumber method which appears to be fully implemented.

    hasPattern

    def hasPattern(self, column, pattern, assertion=None, name=None, hint=None):
        """
        Checks for pattern compliance. Given a column name and a regular expression, defines a
        Check on the average compliance of the column's values to the regular expression.
        :param str column: Column in DataFrame to be checked
        :param Regex pattern: A name that summarizes the current check and the
                metrics for the analysis being done.
        :param lambda assertion: A function with an int or float parameter.
        :param str name: A name for the pattern constraint.
        :param str hint: A hint that states why a constraint could have failed.
        :return: hasPattern self: A Check object that runs the condition on the column.
        """
    

    containsCreditCardNumber

    def containsCreditCardNumber(self, column, assertion=None, hint=None):
        """
        Check to run against the compliance of a column against a Credit Card pattern.
        :param str column: Column in DataFrame to be checked. The column is expected to be a string type.
        :param lambda assertion: A function with an int or float parameter.
        :param hint hint: A hint that states why a constraint could have failed.
        :return: containsCreditCardNumber self: A Check object that runs the compliance on the column.
        """
        assertion = (
            ScalaFunction1(self._spark_session.sparkContext._gateway, assertion)
            if assertion
            else getattr(self._Check, "containsCreditCardNumber$default$2")()
        )
        hint = self._jvm.scala.Option.apply(hint)
        self._Check = self._Check.containsCreditCardNumber(column, assertion, hint)
        return self