rggplot2ggproto

ggplot2 - where are the scales being built?


I wanted to see where factor values are turned into numeric ones. I tried to achieve this by simply adding print statements everywhere...

geom_tile2 <- function(mapping = NULL, data = NULL,
                      stat = "identity2", position = "identity",
                      ...,
                      na.rm = FALSE,
                      show.legend = NA,
                      inherit.aes = TRUE) {
  layer(
    data = data,
    mapping = mapping,
    stat = stat,
    geom = GeomTile2,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      na.rm = na.rm,
      ...
    )
  )
}

GeomTile2 <- ggproto("GeomTile2", GeomRect,
  extra_params = c("na.rm", "width", "height"),

  setup_data = function(data, params) {
    print(data)

    data$width <- data$width %||% params$width %||% resolution(data$x, FALSE)
    data$height <- data$height %||% params$height %||% resolution(data$y, FALSE)

    transform(data,
              xmin = x - width / 2,  xmax = x + width / 2,  width = NULL,
              ymin = y - height / 2, ymax = y + height / 2, height = NULL
    )
  },

  default_aes = aes(fill = "grey20", colour = NA, size = 0.1, linetype = 1,
                    alpha = NA),

  required_aes = c("x", "y"),

  draw_key = draw_key_polygon
)

and

stat_identity2 <- function(mapping = NULL, data = NULL,
                          geom = "point", position = "identity",
                          ...,
                          show.legend = NA,
                          inherit.aes = TRUE) {
  layer(
    data = data,
    mapping = mapping,
    stat = StatIdentity2,
    geom = geom,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      na.rm = FALSE,
      ...
    )
  )
}

StatIdentity2 <- ggproto("StatIdentity2", Stat,

  setup_data = function(data, params) {
    print(data)
    data
  },
  compute_layer = function(data, scales, params) {
    print(data)
    print("stat end")
    data
  }
)

but when I run e.g.

ggplot(data.frame(x = rep(c("y", "n"), 6), y = rep(c("y", "n"), each = 6)), 
       aes(x = x, y = y)) + 
  geom_tile2()

The x and y are numeric from the setup_data function in the stat and onwards. Looking through the package's Github repo, I just can't seem to find where this conversion to coordinates actually happens?


Solution

  • TL;DR

    The conversion from factors to numerical scale for x / y is done by the ggplot2:::Layout$map_position() function, current code here: layout.r

    Long explanation

    I usually think of the steps involved in creating a plot using ggplot2 package in two stages:

    1. Plot construction. This is when a new ggplot object (initialized via ggplot()) & all geom_* / stat_* / facet_* / scale_* / coord_* layers added to it are combined into a single ggplot object. If we write something like p <- ggplot(mpg, aes(class)) + geom_bar(), we stop here. GH code here: plot-construction.r
    2. Plot rendering. This is when the combined ggplot object is converted into an object that can be rendered (via ggplot_build()) and further converted into a gtable of grobs (via ggplot_gtable()). This is usually triggered via the ggplot object's print / plot methods (see here), but we can also use ggplotGrob(), which returns the converted gtable object directly, minus the printing step. GH code for ggplot_build / ggplot_gtable here: plot-build.r

    In my experience, most of the steps we might be interested to tweak are those within the plot rendering stage, and running debug on ggplot2:::ggplot_build.ggplot / ggplot2:::ggplot_gtable.ggplot_built is a good first step to figure out where things happen.

    In this case, after running

    debugonce(ggplot2:::ggplot_build.ggplot)
    
    ggplot(data.frame(x = rep(c("y", "n"), 6), 
                      y = rep(c("y", "n"), each = 6)), 
           aes(x = x, y = y)) + 
      geom_tile() # no need to use the self-defined geom_tile2 here
    

    We begin to step through the function:

    > ggplot2:::ggplot_build.ggplot
    function (plot) 
    {
        plot <- plot_clone(plot)
        if (length(plot$layers) == 0) {
            plot <- plot + geom_blank()
        }
        layers <- plot$layers
        layer_data <- lapply(layers, function(y) y$layer_data(plot$data))
        scales <- plot$scales
        by_layer <- function(f) {
            out <- vector("list", length(data))
            for (i in seq_along(data)) {
                out[[i]] <- f(l = layers[[i]], d = data[[i]])
            }
            out
        }
        data <- layer_data
        data <- by_layer(function(l, d) l$setup_layer(d, plot))
        layout <- create_layout(plot$facet, plot$coordinates)
        data <- layout$setup(data, plot$data, plot$plot_env)
        data <- by_layer(function(l, d) l$compute_aesthetics(d, plot))
        data <- lapply(data, scales_transform_df, scales = scales)
        scale_x <- function() scales$get_scales("x")
        scale_y <- function() scales$get_scales("y")
        layout$train_position(data, scale_x(), scale_y())
        data <- layout$map_position(data)
        data <- by_layer(function(l, d) l$compute_statistic(d, layout))
        data <- by_layer(function(l, d) l$map_statistic(d, plot))
        scales_add_missing(plot, c("x", "y"), plot$plot_env)
        data <- by_layer(function(l, d) l$compute_geom_1(d))
        data <- by_layer(function(l, d) l$compute_position(d, layout))
        layout$reset_scales()
        layout$train_position(data, scale_x(), scale_y())
        layout$setup_panel_params()
        data <- layout$map_position(data)
        npscales <- scales$non_position_scales()
        if (npscales$n() > 0) {
            lapply(data, scales_train_df, scales = npscales)
            data <- lapply(data, scales_map_df, scales = npscales)
        }
        data <- by_layer(function(l, d) l$compute_geom_2(d))
        data <- by_layer(function(l, d) l$finish_statistics(d))
        data <- layout$finish_data(data)
        structure(list(data = data, layout = layout, plot = plot), 
            class = "ggplot_built")
    }
    

    In debug mode, we can check str(data[[i]]) after every step, to examine the data associated with layer i of the ggplot object (i = 1 in this case, since there's only 1 geom layer).

    Browse[2]> 
    debug: data <- lapply(data, scales_transform_df, scales = scales)
    Browse[2]> 
    debug: scale_x <- function() scales$get_scales("x")
    Browse[2]> str(data[[1]]) # still factor after scale_transform_df step
    'data.frame':   12 obs. of  4 variables:
     $ x    : Factor w/ 2 levels "n","y": 2 1 2 1 2 1 2 1 2 1 ...
     $ y    : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 1 1 1 1 ...
     $ PANEL: Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
     $ group: int  4 2 4 2 4 2 3 1 3 1 ...
      ..- attr(*, "n")= int 4
    
    # ... omitted
    
    debug: data <- layout$map_position(data)
    Browse[2]> 
    debug: data <- by_layer(function(l, d) l$compute_statistic(d, layout))
    Browse[2]> str(data[[1]]) # numerical after map_position step
    'data.frame':   12 obs. of  4 variables:
     $ x    : int  2 1 2 1 2 1 2 1 2 1 ...
     $ y    : int  2 2 2 2 2 2 1 1 1 1 ...
     $ PANEL: Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
     $ group: int  4 2 4 2 4 2 3 1 3 1 ...
      ..- attr(*, "n")= int 4
    

    Stat*'s setup_data is triggered by data <- by_layer(function(l, d) l$compute_statistic(d, layout)) (see ggplot2:::Layer$compute_statistic here), which happens after this step. This is why when you insert a print statement in StatIdentity2$setup_data, the data is already in numerical form.

    (And Geom*'s setup_data is triggered by data <- by_layer(function(l, d) l$compute_geom_1(d)), which happens even later.)

    After identifying map_position as the step where things happen, we can run debug mode again & step into this function to see exactly what's going on. At this point, I'm afraid I don't really know what your use case is, so I'll have to leave you to it.