I'm using spark 3.4 version, created a below dataframe
df.show()
ID --> String output ---> Boolean
ID output
AA true
AA false
BB true
BB true
CC true
CC false
CC true
I would like to apply groupby on ID column and aggregate the values on output column (If all the values are true from output column for each ID it should return true else false)
Expected output will be
ID result
AA false
BB true
CC false
what is the best way (windows functions/UDF's) to get desired output using pyspark ? I appreciate your help!
We need to use group by for this scenario, so udf is not a necessity.
from pyspark.sql import functions as F
result_df = df.groupBy("ID").agg(F.min("output").alias("result"))
result_df.show()
Output:
+---+------+
| ID|result|
+---+------+
| AA| false|
| BB| true |
| CC| false|
+---+------+
The same can be done using spark sql as well,
SELECT ID, MIN(output) AS result
FROM my_table
GROUP BY ID;