scalaapache-sparkgenericstypesf-bounded-polymorphism

How to take out Nothing out of inferred type


The idea comes from this video: https://www.youtube.com/watch?v=BfaBeT0pRe0&t=526s, where they talk about implementing type safety through implementation of custom types.

And a possible trivial implementation is

trait Col[Self] { self: Self =>
}

trait Id extends Col[Id]
object IdCol extends Id

trait Val extends Col[Val]
object ValCol extends Val

trait Comment extends Col[Comment]
object CommentCol extends Comment

case class DataSet[Schema >: Nothing](df: DataFrame) {

  def validate[T1 <: Col[T1], T2 <: Col[T2]](
      col1: (Col[T1], String),
      col2: (Col[T2], String)
  ): Option[DataSet[Schema with T1 with T2]] =
    if (df.columns
          .map(e => e.toLowerCase)
          .filter(
            e =>
              e.toLowerCase() == col1._2.toLowerCase || e
                .toLowerCase() == col2._2.toLowerCase
          )
          .length >= 1)
      Some(DataSet[Schema with T1 with T2](df))
    else None
}

object SchemaTypes extends App {

  lazy val spark: SparkSession = SparkSession
    .builder()
    .config(
      new SparkConf()
        .setAppName(
          getClass()
            .getName()
        )
    )
    .getOrCreate()

  import spark.implicits._

  val df = Seq(
    (1, "a", "first value"),
    (2, "b", "second value"),
    (3, "c", "third value")
  ).toDF("Id", "Val", "Comment")

  val myData =
    DataSet/*[Id with Val with Comment]*/(df)
      .validate(IdCol -> "Id", ValCol -> "Val")

  myData match {
    case None => throw new java.lang.Exception("Required columns missing")
    case _    =>
  }
}

The type for myData is Option[DataSet[Nothing with T1 with T2]]. It makes sense since the constructor is called w/o any type parameter, but in the video they show the type to be in line with DataSet[T1 with T2].

Of course, changing the invocation by passing explicity type takes Nothing out, but it is redundant to specify the type parameter value since the types are already included in the arg list.

val myData =
  DataSet[Id with Val with Comment](df).validate(IdCol -> "Id", ValCol -> "Val")

Solution

  • Types Id and Val can be inferred because there are IdCol and ValCol inside .validate. But type Comment can't be inferred. So try

    val myData =
      DataSet[Comment](df)
        .validate(IdCol -> "Id", ValCol -> "Val")
    
    println(shapeless.test.showType(SchemaTypes.myData)) 
    //Option[App.DataSet[App.Comment with App.Id with App.Val]]
    

    https://scastie.scala-lang.org/yj0HnpkyQfCreKq8ZV4D7A

    Actually if you specify DataSet[Id with Val with Comment](df) the type will be Option[DataSet[Id with Val with Comment with Id with Val]], which is equal (=:=) to Option[DataSet[Id with Val with Comment]].


    Ok, I watched the video till that time-code. I guess speakers tried to explain their idea (combining F-bounded polymorphism T <: Col[T] with intersection types T with U) and you shouldn't take their slides literally, there can be inaccuracies there.

    Firstly they show slide

    case class DataSet[Schema](df: DataFrame) {   
      def validate[T <: Col[T]](
        col: (Col[T], String)
      ): Option[DataSet[Schema with T]] = ??? 
    }
    

    and this code can be illustrated with

    val myDF: DataFrame = ???
    val myData = DataSet[VideoId](myDF).validate(Country -> "country_code")
    myData : Option[DataSet[VideoId with Country]]
    

    Then they show slide

    val myData = DataSet(myDF).validate(
      VideoId -> "video_id",
      Country -> "country_code",
      ProfileId -> "profile_id",
      Score -> "score"
    )
    
    myData : DataSet[VideoId with Country with ProfileId with Score]
    

    but this illustrating code doesn't correspond to the previous slide. You should define

    // actually we don't use Schema here
    case class DataSet[Schema](df: DataFrame) {
      def validate[T1 <: Col[T1], T2 <: Col[T2], T3 <: Col[T3], T4 <: Col[T4]](
        col1: (Col[T1], String),
        col2: (Col[T2], String),
        col3: (Col[T3], String),
        col4: (Col[T4], String),
      ): DataSet[T1 with T2 with T3 with T4] = ???
    }
    

    So take it as an idea, not literally.

    You can have something similar with

    case class DataSet[Schema](df: DataFrame) {
      def validate[T <: Col[T]](
        col: (Col[T], String)
      ): Option[DataSet[Schema with T]] = ???
    }
    
    val myDF: DataFrame = ???
    
    val myData = DataSet[Any](myDF).validate(VideoId -> "video_id").flatMap(
      _.validate(Country -> "country_code")
    ).flatMap(
      _.validate(ProfileId -> "profile_id")
    ).flatMap(
      _.validate(Score -> "score")
    )
    
    myData: Option[DataSet[VideoId with Country with ProfileId with Score]]