I'm noticing my code repo is warning me that using withColumn in a for/while loop is an antipattern. Why is this not recommended? Isn't this a normal use of the PySpark API?
We've noticed in practice that using withColumn
inside a for/while loop leads to poor query planning performance as discussed over here. This is not obvious when writing code for the first time in Foundry, so we've built a feature to warn you about this behavior.
We'd recommend you follow the Scala docs recommendation:
withColumn(colName: String, col: Column): DataFrame
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
Since
2.0.0
Note
this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select with the multiple columns at once.
i.e.
my_other_columns = [...]
df = df.select(
*[col_name for col_name in df.columns if col_name not in my_other_columns],
*[F.col(col_name).alias(col_name + "_suffix") for col_name in my_other_columns]
)
is vastly preferred over
my_other_columns = [...]
for col_name in my_other_columns:
df = df.withColumn(
col_name + "_suffix",
F.col(col_name)
)
While this may technically be a normal use of the PySpark API, it will result in poor query planning performance if withColumn is called too many times in your job, so we'd prefer you avoid this problem entirely.