I've got a dataframe that basically has a "key" column and a column that is a JSON string.
key | jsonString |
---|---|
111 | {"id" : "12345", "foo" : "stuff"} |
111 | {"id" : "23456", "bar" : "other stuff"} |
111 | {"id" : "34567", "baz" : "even other stuff"} |
For each "key" value, I want to combine all the JSON strings into a JSON array, with a couple of other elements (with the ultimate goal of publishing to a Kafka topic). So I would end up with this:
{
"type" : "combined",
"values" :
[
{"id" : "12345", "foo" : "stuff"},
{"id" : "23456", "bar" : "other stuff"},
{"id" : "34567", "baz" : "even other stuff"}
]
}
I tried concatenating it all into a big string, but that went poorly. Is there a way to do this that doesn't involve having one giant schema for all my JSON strings? The real JSON is obviously much more complicated, and there are 4 possible schemas each one could be.
One way to get the JSON strings into an array is to group by key and use the collect_list
function (pyspark, scala spark). The collect_set
function could also be used if you wanted to eliminate duplicates.
df.show(false)
val groupedJson = df.groupBy("key").agg(collect_list("jsonString").as("jsonArray"))
groupedJson.show(false)
key | jsonString |
---|---|
1 | {"id": 123, "foo": "stuff"} |
1 | {"id": 456, "foo1": "stuff2"} |
1 | {"id": 789, "foo2": "stuff3"} |
2 | {"id": 000, "foo3": "stuff4"} |
key | jsonArray |
---|---|
1 | [{"id": 123, "foo": "stuff"}, {"id": 456, "foo1": "stuff2"}, {"id": 789, "foo2": "stuff3"}] |
2 | [{"id": 000, "foo3": "stuff4"}] |
I'm not sure if you wanted to put the array back into a JSON, but this would at least set you up to do that if needed. Here's another post that shows how you might do that: Create JSON column in Spark Scala