scalaapache-sparkgroup-by

forming list of columns after groupByKey or groupBy


I have this input DataFrame

input_df:

C1 C2 C3
A 1 12/06/2012
A 2 13/06/2012
B 3 12/06/2012
B 4 17/06/2012
C 5 14/06/2012

and after transformations, I want to get this kind of DataFrame grouping by C1 and creating C4 column which is form by a list of couple from C2 and C3

output_df:

C1 C4
A (1, 12/06/2012), (2, 12/06/2012)
B (3, 12/06/2012), (4, 12/06/2012)
C (5, 12/06/2012)

I approach the result when I try this:

val output_df = input_df.map(x => (x(0), (x(1), x(2))) ).groupByKey()

I obtain this result

(A,CompactBuffer((1, 12/06/2012), (2, 13/06/2012)))    
(B,CompactBuffer((3, 12/06/2012), (4, 17/06/2012)))   
(C,CompactBuffer((5, 14/06/2012)))

But I don't know how to convert this into DataFrame and if this is the good way to do it.
Any advice is welcome even with another approach


Solution

  • //please, try this

    val conf = new SparkConf().setAppName("groupBy").setMaster("local[*]")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._
    
    val rdd = sc.parallelize(
      Seq(("A",1,"12/06/2012"),("A",2,"13/06/2012"),("B",3,"12/06/2012"),("B",4,"17/06/2012"),("C",5,"14/06/2012")) )
    
    val v1 = rdd.map(x => (x._1, x ))
    val v2 = v1.groupByKey()
    val v3 = v2.mapValues(v => v.toArray)
    
    val df2 = v3.toDF("aKey","theValues")
    df2.printSchema()
    
    val first = df2.first
    println (first)
    
    println (first.getString(0))
    
    val values = first.getSeq[Row](1)
    
    val firstArray = values(0)
    
    println (firstArray.getString(0)) //B
    println (firstArray.getInt(1)) //3
    println (firstArray.getString(2)) //12/06/2012