dataframeapache-sparkpysparkcontainsregex-replace

How to use regexp_replace with contains?


I use this function to find if the pattern is in the column and replace it with the replacement but it does not give.

Can someone tell where I make the mistakes?

patterns = [
    '15/19',
    '14/11',
    'HTP',
    'VTP'
]

replacements = [
    'S15/19',
    'S11/14',
    'HTP',
    'VTP'
]
def formate_column(output_column, df, patterns, replacements):
    for p, r in zip(patterns, replacements):
        df = (
            df.withColumn(output_column, F.when(F.col("column").contains(p), F.regexp_replace(F.col("column"), p, r)).otherwise(F.col("column")))
            
    )
    return df

Solution

  • You could move your both lists into one dictionary. Then, looping will be easy and efficient. You don't need when/otherwise. It's enough to use regexp_replace, because if a match is not found, nothing is being replaced in the string.

    Input:

    from pyspark.sql import functions as F
    df = spark.createDataFrame([("xx_15/19_",), ("_14/11111",), ("no",)], ["column"])
    df.show()
    # +---------+
    # |   column|
    # +---------+
    # |xx_15/19_|
    # |_14/11111|
    # |       no|
    # +---------+
    

    Script:

    patterns = {
        r'15/19'  :'S15/19',
        r'14/11'  :'S11/14',
        r'HTP'    :'HTP',
        r'VTP'    :'VTP'
    }
    
    for k, v in patterns.items():
        df = df.withColumn("column", F.regexp_replace("column", k, v))
    df.show()
    # +----------+
    # |    column|
    # +----------+
    # |xx_S15/19_|
    # |_S11/14111|
    # |        no|
    # +----------+
    
    df.explain()
    # == Physical Plan ==
    # *(1) Project [regexp_replace(regexp_replace(regexp_replace(regexp_replace(column#619, 15/19, S15/19, 1), 14/11, S11/14, 1), HTP, HTP, 1), VTP, VTP, 1) AS column#632]
    # +- *(1) Scan ExistingRDD[column#619]