I am having a following aggregation
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1")).alias("final_data1"),
collect_list(array($"b",$"c",$"data2")).alias("final_data2"))
Here I am doing some aggregation and collecting the result with collect_list
. Earlier we were using spark 1 and it was giving me below data types.
|-- final_data1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- final_data2: array (nullable = true)
| |-- element: string (containsNull = true)
Now we have to migrate to spark 2 but we are getting below schema.
|-- final_data1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- final_data1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
On getting first()
record below is the difference
spark 1.6
[2020-09-26, Ayush, 103.67] => datatype string
spark 2
WrappedArray(2020-09-26, Ayush, 103.67)
How can I keep the same data type?
Edit - Tried Using Concat
One way I got exact schema like Spark 1.6 is by using concat like this
val df_date_agg = df
.groupBy($"msisdn",$"event_date",$"network")
.agg(sum($"data_mou").alias("data_mou_dly"),sum($"voice_mou").alias("voice_mou_dly"))
.groupBy($"msisdn")
.agg(collect_list(concat(lit("["),lit($"event_date"),lit(","),lit($"network"),lit(","),lit($"data_mou_dly"),lit("]")))
Will it affect my code performance?? Is there a better way to do this?
Since you want a string representation of an array, how about casting the array into a string like this?
val df_date_agg = df
.groupBy($"a",$"b",$"c")
.agg(sum($"d").alias("data1"),sum($"e").alias("data2"))
.groupBy($"a")
.agg(collect_list(array($"b",$"c",$"data1") cast "string").alias("final_data1"),
collect_list(array($"b",$"c",$"data2") cast "string").alias("final_data2"))
It might simply be what your old version of spark was doing.
The solution you propose would probably work as well by the way but wrapping your column references with lit
is not necessary (lit($"event_date")
). $"event_date"
is enough.