apache-sparkpyspark

UDF? withColumn? Which is better to update columns in pyspark?


If we just implement a simple function to update columns (in place) in pyspark, we can use:

  1. when syntax, e.g.
  df.withColumn("col_name", when(col("reference")==1, False).otherwise(col("col_name"))
  1. udf func. e.g.
  def update_col(reference, col_name):
     if reference == 1:
        return False
     else:
        return col_name

  update_udf = udf(update_col, BooleanType())
  df.withColumn("col_name", update_udf(col("reference"), col("col_name")))

Assume df is pretty big, like billion rows.

Which one will we use? Has anyone had experienced on both ways and compared the performance, like speed, memory cost? Thanks!


Solution

  • In addition to samkart's comment to prefer Spark SQL wherever possible, UDF's regardless of language are a black box for Spark, it cannot optimise their usage .

    Moreover do not use withColumn in loops or maps to modify columns, it's exceptionally expensive. Prefer select with the exact Column's you need, later Spark apis provide withColumns which should also be preferred over withColumn.