pysparkexplodeyelp

I am trying to determine what hours of the day most checkins occur for Yelp checkins


First row of the data Row(business_id='--1UhMGODdWsrMastO9DZw', date='2016-04-26 19:49:16, 2016-08-30 18:36:57, 2016-10-15 02:45:18, 2016-11-18 01:54:50, 2017-04-20 18:39:06, 2017-05-03 17:58:02')

My task is to Create a variable hours_by_checkin_count. This should be a PySpark DataFrame The DataFrame should be ordered by count and contain 24 rows. The DataFrame should have these columns (in this order):hour (the hour of the day as an integer, the hour after midnight being 0) count (the number of checkins that occurred in that hour)

from pyspark.sql.functions import * 
checkin.select('business_id',datesplit('date').alias('dates')).withColumn('checkin_date',explode('dates'))
hours_by_checkin_count = checkin.withColumn('hour', hour('date')) \
    .groupBy('hour') \
    .count() \
    .orderBy('count', ascending=False)

hours_by_checkin_count = hours_by_checkin_count.limit(24)

Both my output and len are incorrect. I am expecting the first row to have hour 1


Solution

  • There are a few small issues that you have in your code, that create the wrong output for you.

    1. PySpark does not allow you to use inplace (like Pandas is doing), thus in your second line of code, the operation is executed, but not stored in a variable.

    2. I don't know exactly what your function datesplit is doing. From the provided context it is not needed.

    3. Structure your code, that you can debug it. This way, you can output intermediate dataframes and check, if the expected result is shown at a certain time.

    Here the code, how you can achieve creating your desired dataframe, grouping it, counting it and then also order it.

    # Explode the dates array and create a new DataFrame 'checkin' with 'business_id' and 'checkin_date' columns
    checkin = checkin.select('business_id', explode('dates').alias('checkin_date'))
    
    # Add a new 'hour' column to the 'checkin' DataFrame by extracting the hour from the 'checkin_date'
    checkin = checkin.withColumn('hour', hour('checkin_date'))
    
    # Group the 'checkin' DataFrame by 'hour' and count the occurrences of each hour
    # Also, order the results by 'hour'
    hours_by_checkin_count = checkin.groupBy('hour').count().orderBy('hour')