sqlapache-sparkapache-spark-sqlcuberollup

What is the difference between cube, rollup and groupBy operators?


I can't find any detailed documentation regarding the differences.

I do notice a difference, because when interchanging cube and groupBy function calls, I get different results. I noticed that for the result using cube, I got a lot of null values on the expressions where I used to use groupBy.


Solution

  • These are not intended to work in the same way. groupBy is simply an equivalent of the GROUP BY clause in standard SQL. In other words

    table.groupBy($"foo", $"bar")
    

    is equivalent to:

    SELECT foo, bar, [agg-expressions] FROM table GROUP BY foo, bar
    

    cube is equivalent to CUBE extension to GROUP BY. It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns. Lets say you have data like this:

    val df = Seq(("foo", 1L), ("foo", 2L), ("bar", 2L), ("bar", 2L)).toDF("x", "y")
    
    df.show
    
    // +---+---+
    // |  x|  y|
    // +---+---+
    // |foo|  1|
    // |foo|  2|
    // |bar|  2|
    // |bar|  2|
    // +---+---+
    

    and you compute cube(x, y) with count as an aggregation:

    df.cube($"x", $"y").count.show
    
    // +----+----+-----+     
    // |   x|   y|count|
    // +----+----+-----+
    // |null|   1|    1|   <- count of records where y = 1
    // |null|   2|    3|   <- count of records where y = 2
    // | foo|null|    2|   <- count of records where x = foo
    // | bar|   2|    2|   <- count of records where x = bar AND y = 2
    // | foo|   1|    1|   <- count of records where x = foo AND y = 1
    // | foo|   2|    1|   <- count of records where x = foo AND y = 2
    // |null|null|    4|   <- total count of records
    // | bar|null|    2|   <- count of records where x = bar
    // +----+----+-----+
    

    A similar function to cube is rollup which computes hierarchical subtotals from left to right:

    df.rollup($"x", $"y").count.show
    // +----+----+-----+
    // |   x|   y|count|
    // +----+----+-----+
    // | foo|null|    2|   <- count where x is fixed to foo
    // | bar|   2|    2|   <- count where x is fixed to bar and y is fixed to  2
    // | foo|   1|    1|   ...
    // | foo|   2|    1|   ...
    // |null|null|    4|   <- count where no column is fixed
    // | bar|null|    2|   <- count where x is fixed to bar
    // +----+----+-----+
    

    Just for comparison lets see the result of plain groupBy:

    df.groupBy($"x", $"y").count.show
    
    // +---+---+-----+
    // |  x|  y|count|
    // +---+---+-----+
    // |foo|  1|    1|   <- this is identical to x = foo AND y = 1 in CUBE or ROLLUP
    // |foo|  2|    1|   <- this is identical to x = foo AND y = 2 in CUBE or ROLLUP
    // |bar|  2|    2|   <- this is identical to x = bar AND y = 2 in CUBE or ROLLUP
    // +---+---+-----+
    

    To summarize:

    ROLLUP and CUBE come from data warehousing extensions so if you want to get a better understanding how this works you can also check documentation of your favorite RDMBS. For example PostgreSQL introduced both in 9.5 and these are relatively well documented.