pythonregexpysparkexpr

Replace parentheses in pyspark with replace_regex


+---+------------+
|  A|           B|
+---+------------+
| x1|        [s1]|
| x2|   [s2 (A2)]|
| x3|   [s3 (A3)]|
| x4|   [s4 (A4)]|
| x5|   [s5 (A5)]|
| x6|   [s6 (A6)]|
+---+------------+

The desired result:

+---+------------+-------+
|A  |B           |value  |
+---+------------+-------+
|x1 |[s1]        |[s1]   |
|x2 |[s2 (A2)]   |[s2]   |
|x3 |[s3 (A3)]   |[s3]   |
|x4 |[s4 (A4)]   |[s4]   |
|x5 |[s5 (A5)]   |[s5]   |
|x6 |[s6 (A6)]   |[s6]   |
+---+------------+-------+

When I applied each of the codes below, the parentheses and the whitespace before them were not replaced:

from pyspark.sql.functions import expr
df.withColumn("C",
               expr('''transform(B, x-> regexp_replace(x, ' \\(A.\\)', ''))''')).show(truncate=False)

or

df.withColumn("C",
               expr('''transform(B, x-> regexp_replace(x, ' \(A.\)', ''))''')).show(truncate=False)

The obtained result:

+---+------------+------------+
|A  |B           |value       |
+---+------------+------------+
|x1 |[s1]        |[s1]        |
|x2 |[s2 (A2)]   |[s2 ()]     |
|x3 |[s3 (A3)]   |[s3 ()]     |
|x4 |[s4 (A4)]   |[s4 ()]     |
|x5 |[s5 (A5)]   |[s5 ()]     |
|x6 |[s6 (A6)]   |[s6 ()]     |
+---+------------+------------+

Solution

  • You can split the array value and get only the first index from the array.

    Example:

    df.show()
    #+---+---------+
    #|  A|        B|
    #+---+---------+
    #| x1|     [s1]|
    #| x2|[s2 (A2)]|
    #+---+---------+
    
    df.printSchema()
    #root
    # |-- A: string (nullable = true)
    # |-- B: array (nullable = true)
    # |    |-- element: string (containsNull = true)
    
    df.withColumn("C",expr('''transform(B,x -> split(x,"\\\s+")[0])''')).show()
    
    #using regexp_replace function
    df.withColumn("C",expr('''transform(B,x -> regexp_replace(x,"(\\\s+.*)",""))''')).show()
    df.withColumn("C",expr('''transform(B,x -> regexp_replace(x,"(\\\s+\\\((?i)A.+\\\))",""))''')).show()
    #+---+---------+----+
    #|  A|        B|   C|
    #+---+---------+----+
    #| x1|     [s1]|[s1]|
    #| x2|[s2 (A2)]|[s2]|
    #+---+---------+----+