rdplyrtidyverse

R Tidyverse Efficient Way to group_by, summarize, mutate, nest Multiple Levels


I am trying to get a large dataset prepped to create an ECharts4R sunburst visualization, which requires a hierarchical data structure. What is an efficient way to group by, summarize, and nest data multiple times?

Below is an example of how the data should look in its final prepped form (code from echarts4r sunburst documentation). They are building it from raw values:

# tibble hierarchical data representation
library(dplyr)
df <- tibble(
  name = c("earth", "mars", "venus"),
  value = c(30, 40, 30),
  # 1st level
  itemStyle = tibble(color = c(NA, "red", "blue")),
  # embedded styles, optional
  children = list(
    tibble(
      name = c("land", "ocean"),
      value = c(10, 20),
      # 2nd level
      children = list(
        tibble(name = c("forest", "river"), value = c(3, 7)),
        # 3rd level
        tibble(
          name = c("fish", "kelp"),
          value = c(10, 5),
          children = list(
            tibble(name = c("shark", "tuna"), value = c(2, 6)),
            # 4th level
            NULL # kelp
          )
        )
      )
    ),
    tibble(name = c("crater", "valley"), value = c(20, 20)),
    NULL # venus
  )
)

I am starting with a dataset:

Using mtcars, this is me trying to get to the same result using the columns am and gear as the root and level 1, respectively

library(tidyverse)

d1 <- 
  rownames_to_column(mtcars)

preppedData <- 
  d1 |> 
  # Root is am
  group_by(
    am
  ) |> 
  summarise(
    mean_mpg = mean(mpg)
  ) |> 
  mutate(
    # Level 1 is gear
    children = list(
      # AM = 0
      bind_rows(
        # Gear 3
        d1 |> 
          filter(
            am == 0,
            gear == 3
            ) |> 
          summarise(mean_mpg = mean(mpg)),
        ## Gear 4
        d1 |> 
          filter(
            am == 0,
            gear == 4
            ) |> 
          summarise(mean_mpg = mean(mpg)),
      ),
      # AM = 1
      bind_rows(
        ## Gear 4
        d1 |> 
          filter(
            am == 1,
            gear == 4
          ) |> 
          summarise(mean_mpg = mean(mpg)),
        ## Gear 5
        d1 |> 
          filter(
            am == 1,
            gear == 5
          ) |> 
          summarise(mean_mpg = mean(mpg))
      )
    )
  )

This kind of coding is going to get unwieldy very fast. It is so repetitive, too, that I think there must be a nest + group_by + apply way to make this more efficient without having to use filter each time, but I rarely manipulate data this way to think of how I would do this.


Solution

  • Tim G, in the comments, points out that there is an easier data structure now that can be used for the sunburst plot.

    https://github.com/JohnCoene/echarts4r/issues/207#issuecomment-718524703

    df <- data.frame(parents = c("","earth", "earth", "mars", "mars", "land", "land", "ocean", "ocean", "fish", "fish", "Everything", "Everything", "Everything"),
                     labels = c("Everything", "land", "ocean", "valley", "crater", "forest", "river", "kelp", "fish", "shark", "tuna", "venus","earth", "mars"),
                     value = c(0, 30, 40, 10, 10, 20, 10, 20, 20, 8, 12, 10, 70, 20))
    
    # create a tree object
    universe <- data.tree::FromDataFrameNetwork(df)
    
    # use it in echarts4r
    universe %>% 
      e_charts() %>% 
      e_sunburst()