scalaapache-sparkapache-spark-sqlapache-spark-datasetapache-spark-encoders

Convert scala list to DataFrame or DataSet


I am new to Scala. I am trying to convert a scala list (which is holding the results of some calculated data on a source DataFrame) to Dataframe or Dataset. I am not finding any direct method to do that. However, I have tried the following process to convert my list to DataSet but it seems not working. I am providing the 3 situations below.

Can someone please provide me some ray of hope, how to do this conversion? Thanks.

import org.apache.spark.sql.{DataFrame, Row, SQLContext, DataFrameReader}
import java.sql.{Connection, DriverManager, ResultSet, Timestamp}
import scala.collection._

case class TestPerson(name: String, age: Long, salary: Double)
var tom = new TestPerson("Tom Hanks",37,35.5)
var sam = new TestPerson("Sam Smith",40,40.5)

val PersonList = mutable.MutableList[TestPerson]()

//Adding data in list
PersonList += tom
PersonList += sam

//Situation 1: Trying to create dataset from List of objects:- Result:Error
//Throwing error
var personDS = Seq(PersonList).toDS()
/*
ERROR:
error: Unable to find encoder for type stored in a Dataset.  Primitive types
   (Int, String, etc) and Product types (case classes) are supported by     
importing sqlContext.implicits._  Support for serializing other types will  
be added in future releases.
     var personDS = Seq(PersonList).toDS()

*/
//Situation 2: Trying to add data 1-by-1 :- Result: not working as desired.    
the last record overwriting any existing data in the DS
var personDS = Seq(tom).toDS()
personDS = Seq(sam).toDS()

personDS += sam //not working. throwing error


//Situation 3: Working. However, I am having consolidated data in the list    
which I want to convert to DS; if I loop the results of the list in comma  
separated values and then pass that here, it will work but will create an  
extra loop in the code, which I want to avoid.
var personDS = Seq(tom,sam).toDS()
scala> personDS.show()
+---------+---+------+
|     name|age|salary|
+---------+---+------+
|Tom Hanks| 37|  35.5|
|Sam Smith| 40|  40.5|
+---------+---+------+

Solution

  • Try without Seq:

    case class TestPerson(name: String, age: Long, salary: Double)
    val tom = TestPerson("Tom Hanks",37,35.5)
    val sam = TestPerson("Sam Smith",40,40.5)
    val PersonList = mutable.MutableList[TestPerson]()
    PersonList += tom
    PersonList += sam
    
    val personDS = PersonList.toDS()
    println(personDS.getClass)
    personDS.show()
    
    val personDF = PersonList.toDF()
    println(personDF.getClass)
    personDF.show()
    personDF.select("name", "age").show()
    

    Output:

    class org.apache.spark.sql.Dataset
    
    +---------+---+------+
    |     name|age|salary|
    +---------+---+------+
    |Tom Hanks| 37|  35.5|
    |Sam Smith| 40|  40.5|
    +---------+---+------+
    
    class org.apache.spark.sql.DataFrame
    
    +---------+---+------+
    |     name|age|salary|
    +---------+---+------+
    |Tom Hanks| 37|  35.5|
    |Sam Smith| 40|  40.5|
    +---------+---+------+
    
    +---------+---+
    |     name|age|
    +---------+---+
    |Tom Hanks| 37|
    |Sam Smith| 40|
    +---------+---+
    

    Also, make sure to move the declaration of the case class TestPerson outside the scope of your object.