apache-sparkpysparkapache-spark-sql

pyspark -- best way to sum values in column of type Array(Integer())


Lets say this is my dataframe ...

name | scores
Dan  |  [10,5,2,12]
Ann  |  [ 12,3,5]
Jon  |  [ ] 

Desired output is something like

name | scores         | Total
Dan  |  [10,5,2,12]   | 29
Ann  |   [ 12,3,5]    | 20
Jon  |  [ ]           | 0

I made a UDF along the lines of ....

sum_cols = udf(lambda arr: if arr == [] then 0 else __builtins__.sum(arr),IntegerType())

df.withColumn('Total', sum_cols(col('scores'))).show()

However, I have learned that UDFs are relatively slow to pure pySpark functions.

Any way to do code above in pySpark without a UDF ?


Solution

  • You can use a higher-order SQL function AGGREGATE (reduce from functional programming), like this:

    import pyspark.sql.functions as F
    df = df.select(
      'name',
      F.expr('AGGREGATE(scores, 0, (acc, x) -> acc + x)').alias('Total')
    )
    

    First argument is the array column, second is initial value (should be of same type as the values you sum, so you may need to use "0.0" or "DOUBLE(0)" etc if your inputs are not integers) and third argument is a lambda function, which adds each element of the array to an accumulator variable (in the beginning this will be set to the initial value 0).

    The transformation will run in a single projection operator, thus will be very efficient. Also you do not need to know the size of the arrays in advance and the array can have different length on each row.