If we just implement a simple function to update columns (in place) in pyspark, we can use:
when
syntax, e.g. df.withColumn("col_name", when(col("reference")==1, False).otherwise(col("col_name"))
udf
func. e.g. def update_col(reference, col_name):
if reference == 1:
return False
else:
return col_name
update_udf = udf(update_col, BooleanType())
df.withColumn("col_name", update_udf(col("reference"), col("col_name")))
Assume df
is pretty big, like billion rows.
Which one will we use? Has anyone had experienced on both ways and compared the performance, like speed, memory cost? Thanks!
In addition to samkart's comment to prefer Spark SQL wherever possible, UDF's regardless of language are a black box for Spark, it cannot optimise their usage .
Moreover do not use withColumn in loops or maps to modify columns, it's exceptionally expensive. Prefer select with the exact Column's you need, later Spark apis provide withColumns which should also be preferred over withColumn.