apache-sparkapache-spark-sqlapache-spark-dataset

Why do columns change to nullable in Apache Spark SQL?


Why is nullable = true used after some functions are executed even though there are no NaN values in the DataFrame.

val myDf = Seq((2,"A"),(2,"B"),(1,"C"))
         .toDF("foo","bar")
         .withColumn("foo", 'foo.cast("Int"))

myDf.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2").show

When df.printSchema is called now nullable will be false for both columns.

val foo: (Int => String) = (t: Int) => {
    fooMap.get(t) match {
      case Some(tt) => tt
      case None => "notFound"
    }
  }

val fooMap = Map(
    1 -> "small",
    2 -> "big"
 )
val fooUDF = udf(foo)

myDf
    .withColumn("foo", fooUDF(col("foo")))
    .withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2")
    .select("foo", "foo_2")
    .printSchema

However now, nullable is true for at least one column which was false before. How can this be explained?


Solution

  • When creating Dataset from statically typed structure (without depending on schema argument) Spark uses a relatively simple set of rules to determine nullable property.

    Since Scala String is java.lang.String, which can be null, generated column can is nullable. For the same reason bar column is nullable in the initial dataset:

    val data1 = Seq[(Int, String)]((2, "A"), (2, "B"), (1, "C"))
    val df1 = data1.toDF("foo", "bar")
    df1.schema("bar").nullable
    
    Boolean = true
    

    but foo is not (scala.Int cannot be null).

    df1.schema("foo").nullable
    
    Boolean = false
    

    If we change data definition to:

    val data2 = Seq[(Integer, String)]((2, "A"), (2, "B"), (1, "C"))
    

    foo will be nullable (Integer is java.lang.Integer and boxed integer can be null):

    data2.toDF("foo", "bar").schema("foo").nullable
    
    Boolean = true
    

    See also: SPARK-20668 Modify ScalaUDF to handle nullability.