I have a Dataframe with different columns where one of the columns is an array of structs:
+----------+---------+--------------------------------------+
|id |title | values|
+----------+---------+--------------------------------------+
| 1 | aaa | [{name1, id1}, {name2, id2},...]|
| 2 | bbb | [{name11, id11}, {name22, id22},...]|
My df schema for this column looks like this:
|-- values: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- id: long (nullable = true)
I want to extract each value from this array column like this:
+----------+---------+--------------+
|id |title |name | id|
+----------+---------+--------------+
| 1 | aaa | name1 | id1 |
| 1 | aaa | name2 | id2 |
| 2 | bbb | name11| id11 |
| 2 | bbb | name22| id22 |
I figured out how to extract the single item of the array:
df = df.withColumn("name", df["values"].getItem(0).name)\
.withColumn("_id", df["id"].getItem(0).id)
But I don't know the way how to apply it for the whole length of array. I probably should do something like:
for index in range(len(df.values)):
df = df.withColumn("name", df["values"].getItem(index).name)\
.withColumn("_id", df["id"].getItem(index).id)\
Just explode
and select
from pyspark.sql.functions import col, explode
df.withColumn("values", explode("values")).select(
"*", col("values")["name"].alias("name"), col("values")["id"].alias("id")
)