I have employee data like below. I want to group the below data by EMP_ID and if 'Status' of this grouped EMP_ID has the value 'Not Done' then entire 'overall_status' for the EMP_ID should be marked as 'Not Done'. How can I achieve this in Dataframe or SparkSql?
Input:
EMP_ID | Status |
---|---|
1 | Done |
1 | Not Done |
1 | Done |
2 | Done |
2 | Done |
Expected Output:
EMP_ID | Status | overall_status |
---|---|---|
1 | Done | Not Done |
1 | Not Done | Not Done |
1 | Done | Not Done |
2 | Done | Done |
2 | Done | Done |
I have tried to solve it using a simple strategy.
I have done a groupby on EMP_ID
and collected distinct values of the Status
columns.
Then I created a overall_status
column based on the fact if the distinct values contain Not Done
then that column will have value Not Done
else Done
.
Then joined this created dataframe with the original on EMP_ID
import sys
from pyspark import SparkContext, SQLContext
from pyspark.sql import functions as F
from pyspark.sql.functions import udf
sc = SparkContext('local')
sqlContext = SQLContext(sc)
data1 = [
["1", "Done"],
["1", "Not Done"],
["1", "Done"],
["2", "Done"],
["2", "Done"],
]
df1Columns = ["EMP_ID", "Status"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
df1.show(n=100, truncate=False)
df1_unique_values = df1.groupby("EMP_ID").agg(F.collect_set( F.col("Status")).alias("distinct_status")) \
.withColumn("overall_status", F.when( F.array_contains( F.col("distinct_status"), "Not Done"), "Not Done").otherwise("Done") ).drop("distinct_status")
df1_unique_values.show(n=100, truncate=False)
df1_final = df1.join(df1_unique_values, on=["EMP_ID"])
df1_final.show(n=100, truncate=False)
Output:
+------+--------+
|EMP_ID|Status |
+------+--------+
|1 |Done |
|1 |Not Done|
|1 |Done |
|2 |Done |
|2 |Done |
+------+--------+
+------+--------------+
|EMP_ID|overall_status|
+------+--------------+
|1 |Not Done |
|2 |Done |
+------+--------------+
+------+--------+--------------+
|EMP_ID|Status |overall_status|
+------+--------+--------------+
|1 |Done |Not Done |
|1 |Not Done|Not Done |
|1 |Done |Not Done |
|2 |Done |Done |
|2 |Done |Done |
+------+--------+--------------+