apache-sparkpyspark

pyspark data frame transforn


i have a dataframe as follows:

f1     |f2
=========
test   | [{"f3": 1, "f4": "f4_1" }, {"f3": 2, "f4": "f4_2" }] 

f2 is a list of objects

i want to get a data frame like below:

f3|f4    | temp_col
=========================
1 |"f4_1"| {"f1": "test"}
2 |"f4_2"| {"f1": "test"}

temp_col is a name i provide.

how do i do that with pyspark?

i have tried using json_normalize by converting to pandas df but it didn't work.


Solution

  • if you already loaded your json into a spark df , here is one way to do it:

    result_df = df.withColumn("f2", explode(df.f2)).select(
        "f2.f3",
        "f2.f4",
        struct(col("f1")).alias("temp_col"),
    )
    

    output:

    f3  f4  temp_col
    1   f4_1    {"f1":"test"}
    2   f4_2    {"f1":"test"}