apache-sparkpysparkapache-spark-sqlaws-glueapache-spark-dataset

PySpark: Find specific value in a grouped data and mark entire group as different value


I have employee data like below. I want to group the below data by EMP_ID and if 'Status' of this grouped EMP_ID has the value 'Not Done' then entire 'overall_status' for the EMP_ID should be marked as 'Not Done'. How can I achieve this in Dataframe or SparkSql?

Input:

EMP_ID Status
1 Done
1 Not Done
1 Done
2 Done
2 Done

Expected Output:

EMP_ID Status overall_status
1 Done Not Done
1 Not Done Not Done
1 Done Not Done
2 Done Done
2 Done Done

Solution

  • I have tried to solve it using a simple strategy.

    I have done a groupby on EMP_ID and collected distinct values of the Status columns.

    Then I created a overall_status column based on the fact if the distinct values contain Not Done then that column will have value Not Done else Done.

    Then joined this created dataframe with the original on EMP_ID

    import sys
    from pyspark import SparkContext, SQLContext
    from pyspark.sql import functions as F
    from pyspark.sql.functions import udf
    
    
    sc = SparkContext('local')
    sqlContext = SQLContext(sc)
    
    
    data1 = [
    
    ["1",   "Done"],
    ["1",   "Not Done"],
    ["1",   "Done"],
    ["2",   "Done"],
    ["2",   "Done"],
          ]
    
    df1Columns = ["EMP_ID", "Status"]
    df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
    
    
    df1.show(n=100, truncate=False)
    
    
    df1_unique_values = df1.groupby("EMP_ID").agg(F.collect_set( F.col("Status")).alias("distinct_status")) \
                            .withColumn("overall_status", F.when(  F.array_contains( F.col("distinct_status"), "Not Done"), "Not Done").otherwise("Done")  ).drop("distinct_status")
    
    df1_unique_values.show(n=100, truncate=False)
    
    
    df1_final =  df1.join(df1_unique_values, on=["EMP_ID"])
    
    
    df1_final.show(n=100, truncate=False)
    

    Output:

    +------+--------+
    |EMP_ID|Status  |
    +------+--------+
    |1     |Done    |
    |1     |Not Done|
    |1     |Done    |
    |2     |Done    |
    |2     |Done    |
    +------+--------+
    
    +------+--------------+
    |EMP_ID|overall_status|
    +------+--------------+
    |1     |Not Done      |
    |2     |Done          |
    +------+--------------+
    
    +------+--------+--------------+
    |EMP_ID|Status  |overall_status|
    +------+--------+--------------+
    |1     |Done    |Not Done      |
    |1     |Not Done|Not Done      |
    |1     |Done    |Not Done      |
    |2     |Done    |Done          |
    |2     |Done    |Done          |
    +------+--------+--------------+