apache-spark-sql

to_json with static name value spark


I have a dataframe with two array columns:

+---------+-----------------------+
|itemval  |fruit                  |
+---------+-----------------------+
|[1, 2, 3]|[apple, banana, orange]|
+---------+-----------------------+

I am trying to zip them and create a name value pair

+---------+-----------------------+--------------------------------------+
|itemval  |fruit                  |ziped                                 |
+---------+-----------------------+--------------------------------------+
|[1, 2, 3]|[apple, banana, orange]|[[1, apple], [2, banana], [3, orange]]|
+---------+-----------------------+--------------------------------------+

and then make it to JSON, the to_json output is formatted like this

+---------------------------------------------------------------------------+
|zipped                                                                     |
+---------------------------------------------------------------------------+
|[{"_1":"1","_2":"apple"},{"_1":"2","_2":"banana"},{"_1":"3","_2":"orange"}]|
+---------------------------------------------------------------------------+

The format, I am expecting is like this

 +---------------------------------------------------------------------------+
 |zipped                                                                     |
 +---------------------------------------------------------------------------+
 |[{"itemval":"1","name":"apple"},{"itemval":"2","name":"banana"},{"itemval":"3","name":"orange"}]|
 +---------------------------------------------------------------------------+

Here is my implementation:

val df1 = Seq((Array(1,2,3),Array("apple","banana","orange"))).toDF("itemval","fruit")
df1.show(false)
def zipper=udf((list1:Seq[String],list2:Seq[String]) => {
   val zipList = list2 zip list1  
 zipList
 
)
df1.withColumn("ziped",to_json(zipper($"fruit",$"itemval"))).drop("itemval","fruit").show(false)

Solution

  • This is the solution which worked for me. Create a schema with new value and cast it to the column

    val schema = ArrayType(
      StructType(
        Array(
          StructField("itemval",StringType),
          StructField("name",StringType)
        )
      )
    )
    
    val casted =zival.withColumn("result",$"ziped".cast(schema))
    casted.show(false)
    casted.select(to_json($"result")).show(false)
    

    and the out put will be

    casted:org.apache.spark.sql.DataFrame
    ziped:array
    element:struct
    _1:string
    _2:string
    result:array
    element:struct
    itemval:string
    name:string
    
    +-----------------------------------------------------------------+
    |structstojson(result)                                            |
    +-----------------------------------------------------------------+
    |[{"itemval":"3","name":"orange"},{"itemval":"2","name":"banana"}]|
    +-----------------------------------------------------------------+