Good morning,
I am noticing a weird behavior that I feel like didn't happen previously. So when building a dataframe, this works:
df1 = df1.withColumn("new1",func1(df1["old1"])
df1 = df1.withColumn("new2",func2(df1["new1"])
However, this throws an error:
df1 = df1.withColumn("new1",func1(df1["old1"]).withColumn("new2",func2(df1["new1"])
The error here being that new1 doesn't exist when referenced in the second column. I feel like that didn't use to be the case unless I just never noticed it. Anything I am missing here?
As samkart stated in their comment, the interim column is not yet created on df1 when you reference it.
You should use col
for this purpose.
Here is a complete repex:
%python
from pyspark.sql.functions import col
from pyspark.sql.types import LongType
data = [{"x_0": 1}, {"x_0": 4}]
data = spark.createDataFrame(data)
def add_one(x):
return x + 1
add_one_udf = udf(add_one, LongType())
data \
.withColumn("x_1", add_one_udf(col("x_0"))) \
.withColumn("x_2", add_one_udf(col("x_1"))) \
.show()
+---+---+---+
|x_0|x_1|x_2|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
+---+---+---+