apache-sparkdatabricks

combining JSON strings into one JSON array


I've got a dataframe that basically has a "key" column and a column that is a JSON string.

key jsonString
111 {"id" : "12345", "foo" : "stuff"}
111 {"id" : "23456", "bar" : "other stuff"}
111 {"id" : "34567", "baz" : "even other stuff"}

For each "key" value, I want to combine all the JSON strings into a JSON array, with a couple of other elements (with the ultimate goal of publishing to a Kafka topic). So I would end up with this:

{
    "type" : "combined",
    "values" :
    [
        {"id" : "12345", "foo" : "stuff"},
        {"id" : "23456", "bar" : "other stuff"},
        {"id" : "34567", "baz" : "even other stuff"}
    ]
}
    

I tried concatenating it all into a big string, but that went poorly. Is there a way to do this that doesn't involve having one giant schema for all my JSON strings? The real JSON is obviously much more complicated, and there are 4 possible schemas each one could be.


Solution

  • One way to get the JSON strings into an array is to group by key and use the collect_list function (pyspark, scala spark). The collect_set function could also be used if you wanted to eliminate duplicates.

    df.show(false)
    val groupedJson = df.groupBy("key").agg(collect_list("jsonString").as("jsonArray"))
    groupedJson.show(false)
    
    key jsonString
    1 {"id": 123, "foo": "stuff"}
    1 {"id": 456, "foo1": "stuff2"}
    1 {"id": 789, "foo2": "stuff3"}
    2 {"id": 000, "foo3": "stuff4"}
    key jsonArray
    1 [{"id": 123, "foo": "stuff"}, {"id": 456, "foo1": "stuff2"}, {"id": 789, "foo2": "stuff3"}]
    2 [{"id": 000, "foo3": "stuff4"}]

    I'm not sure if you wanted to put the array back into a JSON, but this would at least set you up to do that if needed. Here's another post that shows how you might do that: Create JSON column in Spark Scala