pythonpysparkapache-spark-sql

PySpark dataframe aggregations


I'm using spark 3.4 version, created a below dataframe

df.show()

ID --> String output ---> Boolean

ID      output

AA      true
AA      false
BB      true
BB      true
CC      true
CC      false
CC      true

I would like to apply groupby on ID column and aggregate the values on output column (If all the values are true from output column for each ID it should return true else false)

Expected output will be

ID  result

AA  false
BB  true
CC  false

what is the best way (windows functions/UDF's) to get desired output using pyspark ? I appreciate your help!


Solution

  • We need to use group by for this scenario, so udf is not a necessity.

    from pyspark.sql import functions as F
    
    result_df = df.groupBy("ID").agg(F.min("output").alias("result"))
    result_df.show()
    

    Output:

    +---+------+
    | ID|result|
    +---+------+
    | AA| false|
    | BB| true |
    | CC| false|
    +---+------+
    

    The same can be done using spark sql as well,

    SELECT ID, MIN(output) AS result
    FROM my_table
    GROUP BY ID;