The idea comes from this video: https://www.youtube.com/watch?v=BfaBeT0pRe0&t=526s, where they talk about implementing type safety through implementation of custom types.
And a possible trivial implementation is
trait Col[Self] { self: Self =>
}
trait Id extends Col[Id]
object IdCol extends Id
trait Val extends Col[Val]
object ValCol extends Val
trait Comment extends Col[Comment]
object CommentCol extends Comment
case class DataSet[Schema >: Nothing](df: DataFrame) {
def validate[T1 <: Col[T1], T2 <: Col[T2]](
col1: (Col[T1], String),
col2: (Col[T2], String)
): Option[DataSet[Schema with T1 with T2]] =
if (df.columns
.map(e => e.toLowerCase)
.filter(
e =>
e.toLowerCase() == col1._2.toLowerCase || e
.toLowerCase() == col2._2.toLowerCase
)
.length >= 1)
Some(DataSet[Schema with T1 with T2](df))
else None
}
object SchemaTypes extends App {
lazy val spark: SparkSession = SparkSession
.builder()
.config(
new SparkConf()
.setAppName(
getClass()
.getName()
)
)
.getOrCreate()
import spark.implicits._
val df = Seq(
(1, "a", "first value"),
(2, "b", "second value"),
(3, "c", "third value")
).toDF("Id", "Val", "Comment")
val myData =
DataSet/*[Id with Val with Comment]*/(df)
.validate(IdCol -> "Id", ValCol -> "Val")
myData match {
case None => throw new java.lang.Exception("Required columns missing")
case _ =>
}
}
The type for myData is Option[DataSet[Nothing with T1 with T2]]
. It makes sense since the constructor is called w/o any type parameter, but in the video they show the type to be in line with DataSet[T1 with T2]
.
Of course, changing the invocation by passing explicity type takes Nothing
out, but it is redundant to specify the type parameter value since the types are already included in the arg list.
val myData =
DataSet[Id with Val with Comment](df).validate(IdCol -> "Id", ValCol -> "Val")
Types Id
and Val
can be inferred because there are IdCol
and ValCol
inside .validate
. But type Comment
can't be inferred. So try
val myData =
DataSet[Comment](df)
.validate(IdCol -> "Id", ValCol -> "Val")
println(shapeless.test.showType(SchemaTypes.myData))
//Option[App.DataSet[App.Comment with App.Id with App.Val]]
https://scastie.scala-lang.org/yj0HnpkyQfCreKq8ZV4D7A
Actually if you specify DataSet[Id with Val with Comment](df)
the type will be Option[DataSet[Id with Val with Comment with Id with Val]]
, which is equal (=:=
) to Option[DataSet[Id with Val with Comment]]
.
Ok, I watched the video till that time-code. I guess speakers tried to explain their idea (combining F-bounded polymorphism T <: Col[T]
with intersection types T with U
) and you shouldn't take their slides literally, there can be inaccuracies there.
Firstly they show slide
case class DataSet[Schema](df: DataFrame) {
def validate[T <: Col[T]](
col: (Col[T], String)
): Option[DataSet[Schema with T]] = ???
}
and this code can be illustrated with
val myDF: DataFrame = ???
val myData = DataSet[VideoId](myDF).validate(Country -> "country_code")
myData : Option[DataSet[VideoId with Country]]
Then they show slide
val myData = DataSet(myDF).validate(
VideoId -> "video_id",
Country -> "country_code",
ProfileId -> "profile_id",
Score -> "score"
)
myData : DataSet[VideoId with Country with ProfileId with Score]
but this illustrating code doesn't correspond to the previous slide. You should define
// actually we don't use Schema here
case class DataSet[Schema](df: DataFrame) {
def validate[T1 <: Col[T1], T2 <: Col[T2], T3 <: Col[T3], T4 <: Col[T4]](
col1: (Col[T1], String),
col2: (Col[T2], String),
col3: (Col[T3], String),
col4: (Col[T4], String),
): DataSet[T1 with T2 with T3 with T4] = ???
}
So take it as an idea, not literally.
You can have something similar with
case class DataSet[Schema](df: DataFrame) {
def validate[T <: Col[T]](
col: (Col[T], String)
): Option[DataSet[Schema with T]] = ???
}
val myDF: DataFrame = ???
val myData = DataSet[Any](myDF).validate(VideoId -> "video_id").flatMap(
_.validate(Country -> "country_code")
).flatMap(
_.validate(ProfileId -> "profile_id")
).flatMap(
_.validate(Score -> "score")
)
myData: Option[DataSet[VideoId with Country with ProfileId with Score]]