dataframe scala csv apache-spark databricks

Expanding dataframe column into multiple columns

The data frame with a single column value looks like the following

------------------------
| value                |
|----------------------|
| col1,col2,col3,col4  |
| v1,v2,v3,v4          |
| v1,v5,v9,v11         |
|----------------------|

I would like to generate a similar dataframe in spark scala like the following

-----------------------------
| col1 | col2 | col3 | col4 |
|---------------------------|
| v1   | v2   | v3   | v4   |
|---------------------------|
| v1   | v5   | v9   | v11  |
|---------------------------|

One of the way I could think of is may be via generating a new df using withColumn(). However, I am wondering if spark has a better way of doing this.

PS - My initial attempt was to read a csv inside an uber jar in spark env however it looks like there is no easy way to read a csv inside jar as per Load CSV file as dataframe from resources within an Uber Jar

Solution

One possibility is to read the data out of the CSV file, make a Dataset[String] out of it, and then feed it into spark.read.csv:

src/main/scala/App.scala

import org.apache.spark.sql.SparkSession

import scala.io.Source

object App {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.appName("csv-from-resource").master("local").getOrCreate()
    import spark.implicits._
    val csv = spark.createDataset(Source.fromResource("data.csv").getLines().toSeq)
    spark.read.option("header", value = true).csv(csv).show
    spark.stop()
  }

}

src/main/resources/data.csv

col1,col2,col3,col4
v1,v2,v3,v4
v1,v5,v9,v11

build.sbt

ThisBuild / version := "0.1.0-SNAPSHOT"

ThisBuild / scalaVersion := "2.12.18"

lazy val root = (project in file("."))
  .settings(
    name := "spark-playground",
    libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.5.1",
  )

stdout

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|  v1|  v2|  v3|  v4|
|  v1|  v5|  v9| v11|
+----+----+----+----+