rggplot2facet-wrapgeom-point

R's ggplot's geom_point's label : is there any labeller-like solution for "translating" factor values?


Context:

  1. I'm using R on RStudio to make, out of a .csv file (meu_primeiro_csv), a basic ggplot();
  2. this .csv file was imported using read_csv(), and I manually entered the type of each column using col_types = list(). The types, are, thus, correct;
  3. one of these columns is a categorical variable and is very important for the ggplot() I'm making: TP_DEPENDENCIA, since it is used not only on geom_point() as color and shape, but also as facet_wrap();
  4. The TP_DEPENDENCIA is of col_factor() type, and have four possible values: 1, 2, 3, or 4. This is a code in which each value stands for a different type of school: 1 = Federal, 2 = Estadual, 3 = Municipal, or 4 = Privada. There are no other types of school, and there are no NAs, this col_factor() is tidy;
  5. I managed to translate the col_factor() "numbers" to their true "string" meanings on facet_wrap() by using the labeller parameter, as suggested by tamtam's answer for "Changing facet labels in face_wrap() ggplot2" on 'Oct 29, 2020';
  6. However, there is no labeller parameter on geom_point(), so the legend of the graph to the right shows "4, 2, 3, 1" instead of the names that these codes represent.

Questions:

  1. how to "translate" the <fct> type TP_DEPENDENCIA column's number-coded values to their true textual meanings on the graph's legends?
  2. is it possible to make what I ask for on question 1 by employing some presentational change like I did on facet_wrap(), without changing the values stored on the .csv file?

The problematic graph:

My ggplot

The code that generated the above graph:

ggplot(
  data    = meu_primeiro_csv,
  mapping = aes(y = QT_SALAS_UTILIZADAS, x = QT_MAT_BAS)) +
   geom_point(mapping = aes(color = TP_DEPENDENCIA, shape = TP_DEPENDENCIA)) +
   facet_wrap(~TP_DEPENDENCIA, labeller = labeller(TP_DEPENDENCIA = c(`1` = "Federal", `2` = "Estadual", `3` = "Municipal", `4` = "Privada"))) +
   labs(
     title    = "Educação básica: total de alunos × total de salas",
     subtitle = "Totais por tipo de escola: municipal, estadual, federal, ou privada",
     y        = "Salas utilizadas pela escola",
     x        = "Matrículas na educação básica",
     color    = "Tipo de escola",
     shape    = "Tipo de escola"
   ) +
   geom_smooth(method = "lm") +
   scale_color_colorblind()

CSV for reproducible example:


Solution

  • Use case_when to create a new variable in your dataset with the labels for TP_DEPENDENCIA. Use the new variable instead of TP_DEPENDENCIA, and you'll get the labels in the legend.

    meu_primeiro_csv %>% 
      mutate(tipo_de_escola = case_when(TP_DEPENDENCIA == 1 ~ "Federal", 
                                        TP_DEPENDENCIA == 2 ~ "Estadual", 
                                        TP_DEPENDENCIA == 3 ~ "Municipal"
                                        TP_DEPENDENCIA == 4 ~ "Privada"
                                        )
      ) %>% 
      ggplot(
        mapping = aes(y = QT_SALAS_UTILIZADAS, x = QT_MAT_BAS)) +
      geom_point(mapping = aes(color = tipo_de_escola, shape = tipo_de_escola)) +
      facet_wrap(~tipo_de_escola) +
      labs(
        title    = "Educação básica: total de alunos × total de salas",
        subtitle = "Totais por tipo de escola: municipal, estadual, federal, ou privada",
        y        = "Salas utilizadas pela escola",
        x        = "Matrículas na educação básica",
        color    = "Tipo de escola",
        shape    = "Tipo de escola"
      ) +
      geom_smooth(method = "lm") +
      scale_color_colorblind()
    

    However, you should consider showing the variable TP_DEPENDENCIA with only one aesthetic instead of three. Try making the plot where you are only using TP_DEPENDENCIA to either facet_wrap, or colour, or shape. You'll have the same amount of information, and your graph will be simpler. Choose the one you think works best.