dataframeapache-sparkpysparkalteryx

Alteryx regex_countmatches equivalent in PySpark?


I am working on some alteryx workflow migration to PySpark task, as part of which came across the following filter condition.

length([acc_id]) = 9
AND 
(REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0 AND 
REGEX_CountMatches(left([acc_id],2),"[[:alpha:]]")=2)
OR
(REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0 AND 
REGEX_CountMatches(left([acc_id],1),"[[:alpha:]]")=1 AND 
REGEX_CountMatches(right(left([acc_id],2),1), '9')=1 
)

Can someone help me in re-writing this condition in PySpark dataframe?


Solution

  • You can use length with regexp_replace to get the equivalent of Alteryx's REGEX_CountMatches function :

    REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0 
    

    Becomes:

    # replace all non aplhapetic caracters with '' then get length
    F.length(F.regexp_replace(F.expr("right(acc_id, 7)"), '[^A-Za-z]', '')) == 0
    

    right and left functions are only available in SQL, you can use them with expr.

    Full example:

    from pyspark.sql import functions as F
    
    
    df = spark.createDataFrame([("AB1234567",), ("AD234XG1234TT5",)], ["acc_id"])
    
    def regex_count_matches(c: Column, regex: str) -> Column:
        """
        helper function equivalent to REGEX_CountMatches
        """
        return F.length(F.regexp_replace(c, regex, ''))
    
    
    df.filter(
        (F.length("acc_id") == 9) &
        (
          (regex_count_matches(F.expr("right(acc_id, 7)"), '[^A-Za-z]') == 0)
          & (regex_count_matches(F.expr("left(acc_id, 2)"), '[^A-Za-z]') == 2)
        ) | (
          (regex_count_matches(F.expr("right(acc_id, 7)"), '[^A-Za-z]') == 0)
          & (regex_count_matches(F.expr("left(acc_id, 1)"), '[^A-Za-z]') == 1)
          & (regex_count_matches(F.expr("right(left(acc_id, 2), 1)"), '[^9]') == 1)
        )
    ).show()
    
    #+---------+
    #|   acc_id|
    #+---------+
    #|AB1234567|
    #+---------+