dataframescalacsvapache-sparkdatabricks

Expanding dataframe column into multiple columns


The data frame with a single column value looks like the following

------------------------
| value                |
|----------------------|
| col1,col2,col3,col4  |
| v1,v2,v3,v4          |
| v1,v5,v9,v11         |
|----------------------|

I would like to generate a similar dataframe in spark scala like the following

-----------------------------
| col1 | col2 | col3 | col4 |
|---------------------------|
| v1   | v2   | v3   | v4   |
|---------------------------|
| v1   | v5   | v9   | v11  |
|---------------------------|

One of the way I could think of is may be via generating a new df using withColumn(). However, I am wondering if spark has a better way of doing this.

PS - My initial attempt was to read a csv inside an uber jar in spark env however it looks like there is no easy way to read a csv inside jar as per Load CSV file as dataframe from resources within an Uber Jar


Solution

  • One possibility is to read the data out of the CSV file, make a Dataset[String] out of it, and then feed it into spark.read.csv:

    src/main/scala/App.scala

    import org.apache.spark.sql.SparkSession
    
    import scala.io.Source
    
    object App {
    
      def main(args: Array[String]): Unit = {
        val spark = SparkSession.builder.appName("csv-from-resource").master("local").getOrCreate()
        import spark.implicits._
        val csv = spark.createDataset(Source.fromResource("data.csv").getLines().toSeq)
        spark.read.option("header", value = true).csv(csv).show
        spark.stop()
      }
    
    }
    

    src/main/resources/data.csv

    col1,col2,col3,col4
    v1,v2,v3,v4
    v1,v5,v9,v11
    

    build.sbt

    ThisBuild / version := "0.1.0-SNAPSHOT"
    
    ThisBuild / scalaVersion := "2.12.18"
    
    lazy val root = (project in file("."))
      .settings(
        name := "spark-playground",
        libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.5.1",
      )
    

    stdout

    +----+----+----+----+
    |col1|col2|col3|col4|
    +----+----+----+----+
    |  v1|  v2|  v3|  v4|
    |  v1|  v5|  v9| v11|
    +----+----+----+----+