pysparkapache-spark-sql

Combine array of maps into single map in pyspark dataframe


Is there a function similar to the collect_list or collect_set to aggregate a column of maps into a single map in a (grouped) pyspark dataframe? For example, this function might have the following behavior:

>>>df.show()

+--+---------------------------------+
|id|                             map |
+--+---------------------------------+
| 1|                    Map(k1 -> v1)|
| 1|                    Map(k2 -> v2)|
| 1|                    Map(k3 -> v3)|
| 2|                    Map(k5 -> v5)|
| 3|                    Map(k6 -> v6)|
| 3|                    Map(k7 -> v7)|
+--+---------------------------------+

>>>df.groupBy('id').agg(collect_map('map')).show()

+--+----------------------------------+
|id|                 collect_map(map) |
+--+----------------------------------+
| 1| Map(k1 -> v1, k2 -> v2, k3 -> v3)|
| 2|                     Map(k5 -> v5)|
| 3|           Map(k6 -> v6, k7 -> v7)|
+--+----------------------------------+

It probably wouldn't be too difficult to produce the desired result using one of the other collect_ aggregations and a udf, but it seems like something like this should already exist.


Solution

  • it's map_concat in the pyspark version >= 2.4