dataframepysparkdatabricks

Pyspark dataframe withColumn being referenced back on databricks


Good morning,

I am noticing a weird behavior that I feel like didn't happen previously. So when building a dataframe, this works:

df1 = df1.withColumn("new1",func1(df1["old1"])
df1 = df1.withColumn("new2",func2(df1["new1"])

However, this throws an error:

df1 = df1.withColumn("new1",func1(df1["old1"]).withColumn("new2",func2(df1["new1"])

The error here being that new1 doesn't exist when referenced in the second column. I feel like that didn't use to be the case unless I just never noticed it. Anything I am missing here?


Solution

  • As samkart stated in their comment, the interim column is not yet created on df1 when you reference it.

    You should use col for this purpose.

    Here is a complete repex:

    %python
    
    from pyspark.sql.functions import col
    from pyspark.sql.types import LongType
    
    data = [{"x_0": 1}, {"x_0": 4}]
    data = spark.createDataFrame(data)
    
    def add_one(x):
        return x + 1
    
    add_one_udf = udf(add_one, LongType())
    
    data \
        .withColumn("x_1", add_one_udf(col("x_0"))) \
        .withColumn("x_2", add_one_udf(col("x_1"))) \
        .show()
    
    +---+---+---+
    |x_0|x_1|x_2|
    +---+---+---+
    |  1|  2|  3|
    |  4|  5|  6|
    +---+---+---+