I am running a simple query in two versions of spark, 2.3 & 3.2. The code is as below
spark-shell --master yarn --deploy-mode client
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols.head, op_cols.tail: _*)
df2.select("id").show()
In spark 2.3 it returns
+----+
| id |
+----+
| 1 |
| 1 |
+----+
But in spark 3.2 it returns
org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.;
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:97)
I was expecting both versions to have the same result or at least a configuration to make the behavior consistent. setting don't change behavior
spark.sql.analyzer.failAmbiguousSelfJoin=false
spark.sql.caseSensitive=False
On top of this, when using both columns in same case, it works
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols.head, op_cols.tail: _*)
df2.select("id").show()
Even further analysis points out that this behavior was introduced in 2.4. I mean the same query fails even in spark version 2.4
The error was introduced in Spark 2.4 when code was added under expression. In Spark 2.3 we had distinct on the candidates, but later code only had candidates/prunedCandidates did not have distinct added. Once we add the distinct while doing resolve of attributes for plan the behavior is same as that of 2.3
PR for this fix is merged in Spark 3.4 branch. See: https://github.com/apache/spark/pull/40258