apache-sparkapache-spark-sqlscala-spark

Why different behavior when mixed case are used, vs same case are used in spark 3.2


I am running a simple query in spark 3.2

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*)
df2.select("id").show() 

The above query return the result, but when I mix the casing it gives exception

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_diff_case = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols_diff_case.head, op_cols_diff_case.tail: _*)
df2.select("id").show() 

In my test caseSensitive was default (false).
I expect both queries to return the result. Or both queries to fail.
Why is it failing for one and not for the other one?


Solution

  • We see this as an issue or non-issue based on what seems logical to one. There is a long thread on this pull request, where some believe it to be correct while some think its wrong.

    But the pull request changes do make the behavior consistent.