pythondataframeapache-sparkpyspark

Pyspark filter on array of structs


I've a pyspark dataframe with below schema:

root
 |-- channelGrouping: string (nullable = false)
 |-- clientId: string (nullable = false)
 |-- customDimensions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- index: long (nullable = true)
 |    |    |-- value: string (nullable = true)

I want to fetch all those records for which the index "2" of customDimensions matches regular expression r'b'.

Sample Input

+--------+---------------+--------------------------+
|clientId|channelGrouping| customDimensions         |                                                                                                             
+--------+---------------+--------------------------+
|  123   |  alpha        | [[1, a], [2, b]          |
|  456   |  beta         | [[1, a], [2, b], [3, c]  |
|  789   |  gama         | [[1, a], [2, a], [3, c]  |
+--------+---------------+--------------------------+

Expected output:

+--------+---------------+--------------------------+
|clientId|channelGrouping| customDimensions         |                                                                                                             
+--------+---------------+--------------------------+
|  123   |  alpha        | [[1, a], [2, b]          |
|  456   |  beta         | [[1, a], [2, b], [3, c]  |
+--------+---------------+--------------------------+

In bigquery this can be achieved using below query:

select * from some_table where REGEXP_CONTAINS((SELECT value FROM UNNEST(customDimensions) WHERE index=2), r'b')


Solution

  • Managed to write a solution:

    from pyspark.sql.functions import *
    
    df = spark.read.json("some_path")
    
    df1 = df.where(element_at(map_from_entries(col("customDimensions")), 2).rlike('b'))