pythonpysparkapache-spark-sql

How to count unique ID after groupBy in pyspark


I'm using the following code to aggregate students per year. The purpose is to know the total number of students for each year.

from pyspark.sql.functions import col
import pyspark.sql.functions as fn
gr = Df2.groupby(['Year'])
df_grouped = 
gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year'))

The problem is that I discovered that so many IDs are repeated, so the result is wrong and huge.

I want to aggregate the students by year, count the total number of students by year, and avoid the repetition of IDs.


Solution

  • Use countDistinct function

    from pyspark.sql.functions import countDistinct
    x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")]
    y = spark.createDataFrame(x,["year","id"])
    
    gr = y.groupBy("year").agg(countDistinct("id"))
    gr.show()
    

    output

    +----+------------------+
    |year|count(DISTINCT id)|
    +----+------------------+
    |2002|                 2|
    |2001|                 2|
    +----+------------------+