[SOLVED] Pyspark dataframe withColumn being referenced back on databricks

Pyspark dataframe withColumn being referenced back on databricks

Good morning,

I am noticing a weird behavior that I feel like didn't happen previously. So when building a dataframe, this works:

df1 = df1.withColumn("new1",func1(df1["old1"])
df1 = df1.withColumn("new2",func2(df1["new1"])

However, this throws an error:

df1 = df1.withColumn("new1",func1(df1["old1"]).withColumn("new2",func2(df1["new1"])

The error here being that new1 doesn't exist when referenced in the second column. I feel like that didn't use to be the case unless I just never noticed it. Anything I am missing here?

Solution

As samkart stated in their comment, the interim column is not yet created on df1 when you reference it.

You should use col for this purpose.

Here is a complete repex:

%python

from pyspark.sql.functions import col
from pyspark.sql.types import LongType

data = [{"x_0": 1}, {"x_0": 4}]
data = spark.createDataFrame(data)

def add_one(x):
    return x + 1

add_one_udf = udf(add_one, LongType())

data \
    .withColumn("x_1", add_one_udf(col("x_0"))) \
    .withColumn("x_2", add_one_udf(col("x_1"))) \
    .show()

+---+---+---+
|x_0|x_1|x_2|
+---+---+---+
|  1|  2|  3|
|  4|  5|  6|
+---+---+---+