rggplot2dplyrwaterfall

Convert Dataframe to make Waterfall Chart in ggplot2


I want to transform my dataframe into a format that would be suitable for a waterfall chart.

My dataframe is as follows:

employee <- c('A','B','C','D','E','F', 
              'A','B','C','D','E','F',
              'A','B','C','D','E','F',
              'A','B','C','D','E','F')
revenue <- c(10, 20, 30, 40, 10, 40, 
              8, 10, 20, 50, 20, 10,
              2,  5, 70, 30, 10, 50,
             40,  8, 30, 40, 10, 40)
date <- as.Date(c('2017-03-01','2017-03-01','2017-03-01',
                  '2017-03-01','2017-03-01','2017-03-01',
                  '2017-03-02','2017-03-02','2017-03-02',
                  '2017-03-02','2017-03-02','2017-03-02',
                  '2017-03-03','2017-03-03','2017-03-03',
                  '2017-03-03','2017-03-03','2017-03-03',
                  '2017-03-04','2017-03-04','2017-03-04',
                  '2017-03-04','2017-03-04','2017-03-04'))
df <- data.frame(date,employee,revenue)

         date employee revenue
1  2017-03-01        A      10
2  2017-03-01        B      20
3  2017-03-01        C      30
4  2017-03-01        D      40
5  2017-03-01        E      10
6  2017-03-01        F      40
7  2017-03-02        A       8
8  2017-03-02        B      10
9  2017-03-02        C      20
10 2017-03-02        D      50
11 2017-03-02        E      20
12 2017-03-02        F      10
13 2017-03-03        A       2
14 2017-03-03        B       5
15 2017-03-03        C      70
16 2017-03-03        D      30
17 2017-03-03        E      10
18 2017-03-03        F      50
19 2017-03-04        A      40
20 2017-03-04        B       8
21 2017-03-04        C      30
22 2017-03-04        D      40
23 2017-03-04        E      10
24 2017-03-04        F      40

How do I transform this dataframe so that I can get it into a form for a waterfall chart in ggplot2?

The amount column is the difference from the total day by employee.

The end column is the start column minus the amount column.

The start column is the Total end values from previous day.

Final dataframe should look like this:

         date employee     start    end    amount    total_for_day
1  2017-03-01        A         0     10        10               10
2  2017-03-01        B         0     20        20               20
3  2017-03-01        C         0     30        30               30
4  2017-03-01        D         0     40        40               40
5  2017-03-01        E         0     10        10               10
6  2017-03-01        F         0     40        40               40
7  2017-03-01    Total         0    150       150              150
8  2017-03-02        A       150    148        -2                8
9  2017-03-02        B       150    140       -10               10
10 2017-03-02        C       150    140       -10               20
11 2017-03-02        D       150    160        10               50 
12 2017-03-02        E       150    160        10               20
13 2017-03-02        F       150    120       -30               10  
14 2017-03-02    Total       150    118       -32               98
15 2017-03-03        A       118    112        -6                2                      
16 2017-03-03        B       118    113        -5                5                  
17 2017-03-03        C       118    168        50               70
18 2017-03-03        D       118     98       -20               30  
19 2017-03-03        E       118    108       -10               10  
20 2017-03-03        F       118    158        40               50
21 2017-03-03    Total       118    167        49              170  
22 2017-03-04        A       167    205        38               40
23 2017-03-04        B       167    170         3                8
24 2017-03-04        C       167    127       -40               30
25 2017-03-04        D       167    177        10               40
26 2017-03-04        E       167    167         0               10
27 2017-03-04        F       167    157       -10               40 
28 2017-03-04    Total       167    168         1              168

Solution

  • There are a few steps to get you to this, and I think that the dplyr package will help (used heavily below).

    My understanding is that revenue gives the cumulative total revenue, rather than the daily change. If that is wrong, you would need to reverse some of these calculations.

    The first step is to create a new data.frame that calculates the daily totals, then bind that back to the data.frame. Then, you can group_by the employees (including "Total") and add columns that will be created separately for each employee (value on the previous day, the change, and then whether it was an increase or a decrease).

    toPlot <-
      bind_rows(
        df
        , df %>%
          group_by(date) %>%
          summarise(revenue = sum(revenue)) %>%
          mutate(employee = "Total") 
      ) %>%
      group_by(employee) %>%
      mutate(
        previousDay = lag(revenue, default = 0) 
        , change = revenue - previousDay
        , direction = ifelse(change > 0
                             , "Positive"
                             , "Negative"))
    

    returns:

             date employee revenue previousDay change direction
           <date>    <chr>   <dbl>       <dbl>  <dbl>     <chr>
    1  2017-03-01        A      10           0     10  Positive
    2  2017-03-01        B      20           0     20  Positive
    3  2017-03-01        C      30           0     30  Positive
    4  2017-03-01        D      40           0     40  Positive
    5  2017-03-01        E      10           0     10  Positive
    6  2017-03-01        F      40           0     40  Positive
    7  2017-03-02        A       8          10     -2  Negative
    8  2017-03-02        B      10          20    -10  Negative
    9  2017-03-02        C      20          30    -10  Negative
    10 2017-03-02        D      50          40     10  Positive
    # ... with 18 more rows
    

    Then, we can plot that using:

    toPlot %>%
      ggplot(aes(xmin = date - 0.5
                 , xmax = date + 0.5
                 , ymin = previousDay
                 , ymax = revenue
                 , fill = direction)) +
      geom_rect(col = "black"
                , show.legend = FALSE) +
      facet_wrap(~employee
                 , scale = "free_y") +
      scale_fill_brewer(palette = "Set1")
    

    to give

    enter image description here

    Note that including "Total" throws off the scale (requiring the free scales), so I would prefer to omit it:

    toPlot %>%
      filter(employee != "Total") %>%
      ggplot(aes(xmin = date - 0.5
                 , xmax = date + 0.5
                 , ymin = previousDay
                 , ymax = revenue
                 , fill = direction)) +
      geom_rect(col = "black"
                , show.legend = FALSE) +
      facet_wrap(~employee) +
      scale_fill_brewer(palette = "Set1")
    

    For this to allow direct comparsion between employees

    enter image description here

    and this for the overall total

    toPlot %>%
      filter(employee == "Total") %>%
      ggplot(aes(xmin = date - 0.5
                 , xmax = date + 0.5
                 , ymin = previousDay
                 , ymax = revenue
                 , fill = direction)) +
      geom_rect(col = "black"
                , show.legend = FALSE) +
      scale_fill_brewer(palette = "Set1")
    

    enter image description here

    though I still find line graphs to be easier to interpret (especially comparing employees):

    toPlot %>%
      filter(employee != "Total") %>%
      ggplot(aes(x = date
                 , y = revenue
                 , col = employee)) +
      geom_line() +
      scale_fill_brewer(palette = "Dark2")
    

    enter image description here

    If you want to plot the changes themselves by day, you can do:

    toPlot %>%
      filter(employee != "Total") %>%
      ggplot(aes(x = date
                 , y = change
                 , fill = employee)) +
      geom_col(position = "dodge") +
      scale_fill_brewer(palette = "Dark2")
    

    to get:

    enter image description here

    but now you are getting rather far from the "waterfall" plot outputs. If you really, really want to make a waterfall comparable across plots you can, but it is going to be rather ugly (I'd strongly recommend the line plot above instead).

    Here, you need to manually move the boxes around, and this will require some tinkering if you change the output aspect ratio (or size) or the number of employees. You also need to include colors for both the employee and the direction of the change, which starts to look rough. This falls into the category of "can, but probably shouldn't" -- there is likely a better way to display these data.

    toPlot %>%
      filter(employee != "Total") %>%
      ungroup() %>%
      mutate(empNumber = as.numeric(as.factor(employee))) %>%
      ggplot(aes(xmin = (empNumber) - 0.4
                 , xmax = (empNumber) + 0.4
                 , ymin = previousDay
                 , ymax = revenue
                 , col = direction
                 , fill = employee)) +
      geom_rect(size = 1.5) +
      facet_grid(~date) +
      scale_fill_brewer(palette = "Dark2") +
      theme(axis.text.x = element_blank()
            , axis.ticks.x = element_blank())
    

    gives

    enter image description here