[SOLVED] combining JSON strings into one JSON array

combining JSON strings into one JSON array

I've got a dataframe that basically has a "key" column and a column that is a JSON string.

key	jsonString
111	{"id" : "12345", "foo" : "stuff"}
111	{"id" : "23456", "bar" : "other stuff"}
111	{"id" : "34567", "baz" : "even other stuff"}

For each "key" value, I want to combine all the JSON strings into a JSON array, with a couple of other elements (with the ultimate goal of publishing to a Kafka topic). So I would end up with this:

{
    "type" : "combined",
    "values" :
    [
        {"id" : "12345", "foo" : "stuff"},
        {"id" : "23456", "bar" : "other stuff"},
        {"id" : "34567", "baz" : "even other stuff"}
    ]
}

I tried concatenating it all into a big string, but that went poorly. Is there a way to do this that doesn't involve having one giant schema for all my JSON strings? The real JSON is obviously much more complicated, and there are 4 possible schemas each one could be.

Solution

One way to get the JSON strings into an array is to group by key and use the collect_list function (pyspark, scala spark). The collect_set function could also be used if you wanted to eliminate duplicates.

df.show(false)
val groupedJson = df.groupBy("key").agg(collect_list("jsonString").as("jsonArray"))
groupedJson.show(false)

key	jsonString
1	{"id": 123, "foo": "stuff"}
1	{"id": 456, "foo1": "stuff2"}
1	{"id": 789, "foo2": "stuff3"}
2	{"id": 000, "foo3": "stuff4"}

key	jsonArray
1	[{"id": 123, "foo": "stuff"}, {"id": 456, "foo1": "stuff2"}, {"id": 789, "foo2": "stuff3"}]
2	[{"id": 000, "foo3": "stuff4"}]

I'm not sure if you wanted to put the array back into a JSON, but this would at least set you up to do that if needed. Here's another post that shows how you might do that: Create JSON column in Spark Scala