I have the following data in a dataframe df
:
{
"data":[
{
"id":"a",
"val":1
},
{
"id":"b",
"val":2
}
]
}
I would now like to add a new column "test" for the id 'b'
containing its value 2
.
I know I can do this with:
(
df
.withColumn(
"test",
F.expr("filter(data,x->x.id=='b')")[0]["val"]
)
.show()
)
Yielding the desired:
+----------------+----+
| data |test|
+----------------+----+
|[{a, 1}, {b, 2}]| 2 |
+----------------+----+
Could this be achieved in a more "native" way (not using SQL)? I know that F.col("data")[1]["val"]
can be used if I go by index rather than ID as an example.
You can try filter in Pyspark API.
df.withColumn("test",
F.filter('data', lambda x: x.id == 'b')[0].val)