The data frame with a single column value looks like the following
------------------------
| value |
|----------------------|
| col1,col2,col3,col4 |
| v1,v2,v3,v4 |
| v1,v5,v9,v11 |
|----------------------|
I would like to generate a similar dataframe in spark scala like the following
-----------------------------
| col1 | col2 | col3 | col4 |
|---------------------------|
| v1 | v2 | v3 | v4 |
|---------------------------|
| v1 | v5 | v9 | v11 |
|---------------------------|
One of the way I could think of is may be via generating a new df using withColumn(). However, I am wondering if spark has a better way of doing this.
PS - My initial attempt was to read a csv inside an uber jar in spark env however it looks like there is no easy way to read a csv inside jar as per Load CSV file as dataframe from resources within an Uber Jar
One possibility is to read the data out of the CSV file, make a Dataset[String]
out of it, and then feed it into spark.read.csv
:
src/main/scala/App.scala
import org.apache.spark.sql.SparkSession
import scala.io.Source
object App {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("csv-from-resource").master("local").getOrCreate()
import spark.implicits._
val csv = spark.createDataset(Source.fromResource("data.csv").getLines().toSeq)
spark.read.option("header", value = true).csv(csv).show
spark.stop()
}
}
src/main/resources/data.csv
col1,col2,col3,col4
v1,v2,v3,v4
v1,v5,v9,v11
build.sbt
ThisBuild / version := "0.1.0-SNAPSHOT"
ThisBuild / scalaVersion := "2.12.18"
lazy val root = (project in file("."))
.settings(
name := "spark-playground",
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.5.1",
)
stdout
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| v1| v2| v3| v4|
| v1| v5| v9| v11|
+----+----+----+----+