jsonscalaapache-sparkspray-json

Generic way to Parse Spark DataFrame to JSON Object/Array Using Spray JSON


I'm trying to find a generic way(without using a concrete case class in Scala) to parse Spark DataFrame to JSON Object/Array using Spray JSON or any other library.

I have tried to approach this using spray-json and my current code looks something like this

import spray.json._
import spray.json.DefaultJsonProtocol._

val list = sc.parallelize(List(("a1","b1","c1","d1"),("a2","b2","c2","d2"))).toDF

list.show
+---+---+---+---+                                                               
| _1| _2| _3| _4|
+---+---+---+---+
| a1| b1| c1| d1|
| a2| b2| c2| d2|
+---+---+---+---+

val json = list.toJSON.collect.toJson.prettyPrint

println(json)

Current Output:

["{\"_1\":\"a1\",\"_2\":\"b1\",\"_3\":\"c1\",\"_4\":\"d1\"}", "{\"_1\":\"a2\",\"_2\":\"b2\",\"_3\":\"c2\",\"_4\":\"d2\"}"]

Expected Output:

[{
    "_1": "a1",
    "_2": "b1",
    "_3": "c1",
    "_4": "d1"
}, {
    "_1": "a2",
    "_2": "b2",
    "_3": "c2",
    "_4": "d2"
}]

Kindly suggest how to get the expected output in the required format without using a "concrete scala case class". Either using spray-json or any other library.


Solution

  • After trying various approach using various libraries, I finally settled with the below simple approach.

    val list = sc.parallelize(List(("a1","b1","c1","d1"),("a2","b2","c2","d2"))).toDF
    
    val jsonArray = list.toJSON.collect
    /*jsonArray: Array[String] = Array({"_1":"a1","_2":"b1","_3":"c1","_4":"d1"}, {"_1":"a2","_2":"b2","_3":"c2","_4":"d2"})*/
    
    val finalOutput = jsonArray.mkString("[", ",", "]")
    
    /*finalOutput: String = [{"_1":"a2","_2":"b2","_3":"c2","_4":"d2"},{"_1":"a1","_2":"b1","_3":"c1","_4":"d1"}]*/
    

    In this approach, we no need to use spray-JSON or any other library.

    Special thanks to @Aman Sehgal. His answer helped me to come up with this optimal solution.

    Note: I'm yet to analyze the performance of this approach using a large DF but with some basic performance testing it looks equally fast to ".toJson.prettyPrint" of "spray-json".