pythonapache-sparkpyspark

Add column by accessing item in Array based on ID without SQL expression


I have the following data in a dataframe df:

{
    "data":[
        {
            "id":"a",
            "val":1
        },
        {
            "id":"b",
            "val":2
        }
    ]
}

I would now like to add a new column "test" for the id 'b' containing its value 2.

I know I can do this with:

(
    df
    .withColumn(
        "test",
        F.expr("filter(data,x->x.id=='b')")[0]["val"]
    )
    .show()
)

Yielding the desired:

+----------------+----+
| data           |test|
+----------------+----+
|[{a, 1}, {b, 2}]| 2  |
+----------------+----+

Could this be achieved in a more "native" way (not using SQL)? I know that F.col("data")[1]["val"] can be used if I go by index rather than ID as an example.


Solution

  • You can try filter in Pyspark API.

    df.withColumn("test", 
                  F.filter('data', lambda x: x.id == 'b')[0].val)