pythonpysparkdata-manipulationrepeat

Counting repetitons in Pyspark


Currently I'm working with a large dataframe and faced with an issue.

I want to return a number of time (count) each value is repeated in a table.

For example: number 10 is repeated twice, so I want to get number 2 and so on...

My code is:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType

right_table_23 = [
    ("ID1", 2),
    ("ID2", 3),
    ("ID3", 5),
    ("ID4", 6),
    ("ID6", 10),
    ("ID8", 15),
    ("ID9", 10),
    ("ID10", 5),
    ("ID2", 5),
    ("ID3", 8),
    ("ID4", 3),
    ("ID2", 2),
    ("ID3", 4),
    ("ID4", 3)
]

A schema for the table showed above:

schema = StructType([
    StructField("ID", StringType(), True),
    StructField("Count", IntegerType(), True)
    ])

Next I create my table with the following code:

df_right_table_23 = spark.createDataFrame(right_table_23, schema)

In order to count the number of repetitions I use the following code:

#It can be implemented in order to find repetitions for a number 2
df_right_table_23.select().where(df_right_table_23.count == 2).count()

But if the range of digits include numbers from 2 up to 100 it is hard and time-consuming to rewrite the above-mentioned code.

Is it possible to somehow automate the process of counting repetitions?


Solution

  • You dont need to stress when You can automate the process of counting the repetitions of each value in your DataFrame simply with the good old the groupBy and count functions in PySpark.

    I must say you already there , here is a code snippet to help you

    from pyspark.sql import SparkSession
    
    # Create a Spark session
    spark = SparkSession.builder.appName("CountRepetitions").getOrCreate()
    
    # your schema
    schema = StructType([
        StructField("ID", StringType(), True),
        StructField("Value", IntegerType(), True)  # rename 'Count' to 'Value' makes sense that way
    ])
    
    # df 
    df_right_table_23 = spark.createDataFrame(df_right_table_23, schema)
    
    # shows the both 'ID' and 'Value' columns and count the number of occurrences for each
    result = df_right_table_23.groupBy("ID", "Value").count()
    
    # change 'count' column to 'occurrences' for sake of simplicity
    result = result.withColumnRenamed("count", "Occurrences")
    
    # display 
    result.show()
    
    

    show results here

    +----+-----+-----------+
    |  ID|Value|Occurrences|
    +----+-----+-----------+
    | ID1|    2|          1|
    | ID2|    3|          1|
    | ID3|    5|          1|
    | ID6|   10|          1|
    | ID4|    6|          1|
    | ID9|   10|          1|
    | ID8|   15|          1|
    |ID10|    5|          1|
    | ID3|    8|          1|
    | ID2|    5|          1|
    | ID4|    3|          2|
    | ID2|    2|          1|
    | ID3|    4|          1|
    +----+-----+-----------+