scalaapache-sparkapache-spark-sqlbuckets

Spark Bucketizer - show all buckets even if there are no elements


I have a list of events something like shown below generated from a spark dataframe. I am using Spark 2.2.0 with Scala.

val events = df.select($"event", hour($"time") as "hour", to_date($"time", "yyyy-MM-dd") as "day")

+-----+-----+----------+ |event|hour | day| +-----+-----+----------+ |event1| 18|2015-02-05| |event1| 17|2015-02-19| |event5| 18|2015-02-02| |event5| 19|2015-02-02| |event1| 1|2015-03-17| |event1| 0|2015-02-03| |event1| 20|2015-02-02| |event1| 22|2015-02-02| |event1| 23|2015-02-02| |event1| 18|2015-02-09| |event1| 19|2015-02-09| |event1| 21|2015-02-09| |event1| 21|2015-04-06| |event1| 23|2015-02-09| |event1| 20|2015-02-16| |event2| 19|2015-02-12| |event3| 18|2015-02-18| |event1| 22|2015-02-16| |event2| 17|2015-02-04| |event1| 23|2015-02-16| +-----+----+----------+ only showing top 20 rows

I need to create hourly buckets and count how many events happen every hour. So my approach was to create buckets (24 of them) and count the events in their specific hour slot as shown below.

val splits = (0 to 24).map(_ * 1.0).toArray
val bucketizer = new Bucketizer()
    .setInputCol("hour")
    .setOutputCol("bucket")
    .setSplits(splits)

val bucket = bucketizer.transform(events)

val result = bucket.groupBy($"day", $"bucket").agg(count($"event").as("count")).orderBy(asc("bucket"))

result.filter($"day" === "2015-05-21").orderBy(asc("bucket")).show()

And the result of the above code is

+----------+------+-----+ | day|bucket|count| +----------+------+-----+ |2015-05-21| 0.0| 1| |2015-05-21| 2.0| 1| |2015-05-21| 11.0| 1| |2015-05-21| 17.0| 1| |2015-05-21| 18.0| 4| |2015-05-21| 19.0| 4| |2015-05-21| 21.0| 1| |2015-05-21| 22.0| 3| |2015-05-21| 23.0| 1| +----------+------+-----+

Which is correct. However what I was expecting as output is something like this:

+----------+------+-----+ | day|bucket|count| +----------+------+-----+ |2015-05-21| 0.0| 1| |2015-05-21| 1.0| 0| |2015-05-21| 2.0| 1| |2015-05-21| 3.0| 0| |2015-05-21| 4.0| 0| |2015-05-21| 5.0| 0| : : |2015-05-21| 11.0| 1| |2015-05-21| 12.0| 0| |2015-05-21| 13.0| 0| : : |2015-05-21| 17.0| 1| |2015-05-21| 18.0| 4| |2015-05-21| 19.0| 4| |2015-05-21| 20.0| 0| |2015-05-21| 21.0| 1| |2015-05-21| 22.0| 3| |2015-05-21| 23.0| 1| +----------+------+-----+

Basically, the bins (buckets) that have no events should be populated with 0. Any idea how to achieve that?

Thank you!


Solution

  • This is my current solution without using Bucketizer (I admit not very pretty)

    val events = df.select($"event", hour($"time") as "hour", to_date($"time", "yyyy-MM-dd") as "day")
    
    val left = (0 to 24).toDF.withColumnRenamed("value", "hour")
    val right = or_counts.filter($"day" === "2015-05-21").groupBy($"hour").agg(count("event").as("count")).orderBy(asc("hour"))
    
    left.join(right, Seq("hour"), "left_outer").na.fill(0, Seq("count")).show()
    

    And this code returns something as shown below:

    +----+-----+ |hour|count| +----+-----+ | 0| 1| | 1| 0| | 2| 1| | 3| 0| | 4| 0| | 5| 0| | 6| 0| | 7| 0| | 8| 0| | 9| 0| | 10| 0| | 11| 1| | 12| 0| | 13| 0| | 14| 0| | 15| 0| | 16| 0| | 17| 1| | 18| 4| | 19| 4| | 20| 0| | 21| 1| | 22| 3| | 23| 1| | 24| 0| +----+-----+ Which was my expected result. If anyone can come up with a nicer solution I will accept that answer.

    Thank you!