rggplot2linepointviolin-plot

integrate two dataframes in ggplot2


I would like to add some points to a geom_violin in ggplot2. Below an overview of how the figure is structured: enter image description here

Essentially, for each one of the eight populations there should be an additional extra value which will be plotted as a single point which happes to be way above the violin shapes; in theory, representing for a cap count in that population. I wish to also connect these dots with a line if possible.

Here, is the code I've been working with:

library(dplyr)
library(readxl)
library(tibble)
library(ggplot2)
library(hrbrthemes)
library(introdataviz)

### IMPORT THE DATASET
variants_dist <- read_excel("path/to/.xlsm", 11)
cap <- read_excel("path/to.xlsm", 12)


### FEATURES WRANGLING TO GET THE RIGHT FORMAT
variants_dist <- variants_dist %>%
  mutate(population_ID=factor(population_ID, levels=c("AFR", "EUR", "MENA", "SAS", "CEA", "SIB", "OCE", "AME")))
variants_dist %>% arrange(population_ID) -> pop_sort

pop_sort <- pop_sort %>%
  mutate(variant_type=factor(variant_type, levels=c("SNPs", "INDELs")))
pop_sort %>% arrange(variant_type) -> variant_sort

df_var = variant_sort %>% group_by(population_ID) %>% summarise(num=n())


### PLOT THE DATA
violin_variants <- variant_sort %>%
  left_join(df_var) %>%
  mutate(pop_count = paste0(population_ID, "\n", "n=", num/2)) %>%
  ggplot(aes(x=forcats::fct_inorder(pop_count), y=count, fill=population_ID)) +
  geom_violin(position="dodge", trim=FALSE) +
  geom_boxplot(width=0.07, color="black", alpha=0.6) +
  scale_fill_manual(values=c(EUR="dodgerblue2", MENA="mediumvioletred", SIB="darkkhaki", 
                             CEA="firebrick2", AFR="olivedrab2", OCE="powderblue", 
                             SAS="darksalmon", AME="plum2")) +
  theme_bw() +
  theme(
    legend.position="none",
  ) +
  xlab("") + ylab("")

violin_variants +
  facet_wrap(~variant_type, scales="free_y", ncol=1, switch="y")

Although this is something I've done in more trivial cases, in this example because of the operations done on the first data frame (variants_dist) it has been proven harder than i though and few attempts led to wanky outputs... To reproduce, here are two dput() of variant_dist and cap.

variant_dist

structure(list(samples = c("abh100 - number of indels:", "abh100 - number of SNPs:", 
"abh107 - number of indels:", "abh107 - number of SNPs:", "ALB212 - number of indels:", 
"ALB212 - number of SNPs:", "Ale14 - number of indels:", "Ale14 - number of SNPs:", 
"Ale20 - number of indels:", "Ale20 - number of SNPs:", "Ale22 - number of indels:", 
"Ale22 - number of SNPs:", "Ale32 - number of indels:", "Ale32 - number of SNPs:", 
"altai363p - number of indels:", "altai363p - number of SNPs:", 
"armenia293 - number of indels:", "armenia293 - number of SNPs:", 
"Armenian222 - number of indels:", "Armenian222 - number of SNPs:", 
"Ayodo_430C - number of indels:", "Ayodo_430C - number of SNPs:", 
"Ayodo_502C - number of indels:", "Ayodo_502C - number of SNPs:", 
"Ayodo_81S - number of indels:", "Ayodo_81S - number of SNPs:", 
"B11 - number of indels:", "B11 - number of SNPs:", "B17 - number of indels:", 
"B17 - number of SNPs:", "Bishkek28439 - number of indels:", 
"Bishkek28439 - number of SNPs:", "Bishkek28440 - number of indels:", 
"Bishkek28440 - number of SNPs:", "Bu16 - number of indels:", 
"Bu16 - number of SNPs:", "Bu5 - number of indels:", "Bu5 - number of SNPs:", 
"BulgarianB4 - number of indels:", "BulgarianB4 - number of SNPs:", 
"BulgarianC1 - number of indels:", "BulgarianC1 - number of SNPs:", 
"ch113 - number of indels:", "ch113 - number of SNPs:", "CHI-007 - number of indels:", 
"CHI-007 - number of SNPs:", "CHI-034 - number of indels:", "CHI-034 - number of SNPs:", 
"DNK05 - number of indels:", "DNK05 - number of SNPs:", "DNK07 - number of indels:", 
"DNK07 - number of SNPs:", "DNK11 - number of indels:", "DNK11 - number of SNPs:", 
"Dus16 - number of indels:", "Dus16 - number of SNPs:", "Dus22 - number of indels:", 
"Dus22 - number of SNPs:", "Esk29 - number of indels:", "Esk29 - number of SNPs:", 
"Est375 - number of indels:", "Est375 - number of SNPs:", "Est400 - number of indels:", 
"Est400 - number of SNPs:", "HG00126 - number of indels:", "HG00126 - number of SNPs:", 
"HG00128 - number of indels:", "HG00128 - number of SNPs:", "HG00174 - number of indels:", 
"HG00174 - number of SNPs:", "HG00190 - number of indels:", "HG00190 - number of SNPs:", 
"HG00360 - number of indels:", "HG00360 - number of SNPs:", "HG01503 - number of indels:", 
"HG01503 - number of SNPs:", "HG01504 - number of indels:", "HG01504 - number of SNPs:", 
"HG01600 - number of indels:", "HG01600 - number of SNPs:", "HG01846 - number of indels:", 
"HG01846 - number of SNPs:", "HG02464 - number of indels:", "HG02464 - number of SNPs:", 
"HG02494 - number of indels:", "HG02494 - number of SNPs:", "HG02574 - number of indels:", 
"HG02574 - number of SNPs:", "HG02724 - number of indels:", "HG02724 - number of SNPs:", 
"HG02783 - number of indels:", "HG02783 - number of SNPs:", "HG02790 - number of indels:", 
"HG02790 - number of SNPs:", "HG02943 - number of indels:", "HG02943 - number of SNPs:", 
"HG03006 - number of indels:", "HG03006 - number of SNPs:", "HG03007 - number of indels:", 
"HG03007 - number of SNPs:", "HG03078 - number of indels:", "HG03078 - number of SNPs:", 
"HG03085 - number of indels:", "HG03085 - number of SNPs:", "HG03100 - number of indels:", 
"HG03100 - number of SNPs:", "HGDP00019 - number of indels:", 
"HGDP00019 - number of SNPs:", "HGDP00027 - number of indels:", 
"HGDP00027 - number of SNPs:", "HGDP00058 - number of indels:", 
"HGDP00058 - number of SNPs:", "HGDP00090 - number of indels:", 
"HGDP00090 - number of SNPs:", "HGDP00124 - number of indels:", 
"HGDP00124 - number of SNPs:", "HGDP00125 - number of indels:", 
"HGDP00125 - number of SNPs:", "HGDP00157 - number of indels:", 
"HGDP00157 - number of SNPs:", "HGDP00160 - number of indels:", 
"HGDP00160 - number of SNPs:", "HGDP00195 - number of indels:", 
"HGDP00195 - number of SNPs:", "HGDP00208 - number of indels:", 
"HGDP00208 - number of SNPs:", "HGDP00216 - number of indels:", 
"HGDP00216 - number of SNPs:", "HGDP00232 - number of indels:", 
"HGDP00232 - number of SNPs:", "HGDP00286 - number of indels:", 
"HGDP00286 - number of SNPs:", "HGDP00328 - number of indels:", 
"HGDP00328 - number of SNPs:", "HGDP00338 - number of indels:", 
"HGDP00338 - number of SNPs:", "HGDP00428 - number of indels:", 
"HGDP00428 - number of SNPs:", "HGDP00449 - number of indels:", 
"HGDP00449 - number of SNPs:", "HGDP00457 - number of indels:", 
"HGDP00457 - number of SNPs:", "HGDP00461 - number of indels:", 
"HGDP00461 - number of SNPs:", "HGDP00474 - number of indels:", 
"HGDP00474 - number of SNPs:", "HGDP00476 - number of indels:", 
"HGDP00476 - number of SNPs:", "HGDP00526 - number of indels:", 
"HGDP00526 - number of SNPs:", "HGDP00530 - number of indels:", 
"HGDP00530 - number of SNPs:", "HGDP00533 - number of indels:", 
"HGDP00533 - number of SNPs:", "HGDP00540 - number of indels:", 
"HGDP00540 - number of SNPs:", "HGDP00541 - number of indels:", 
"HGDP00541 - number of SNPs:", "HGDP00543 - number of indels:", 
"HGDP00543 - number of SNPs:", "HGDP00545 - number of indels:", 
"HGDP00545 - number of SNPs:", "HGDP00546 - number of indels:", 
"HGDP00546 - number of SNPs:", "HGDP00547 - number of indels:", 
"HGDP00547 - number of SNPs:", "HGDP00548 - number of indels:", 
"HGDP00548 - number of SNPs:", "HGDP00549 - number of indels:", 
"HGDP00549 - number of SNPs:", "HGDP00550 - number of indels:", 
"HGDP00550 - number of SNPs:", "HGDP00551 - number of indels:", 
"HGDP00551 - number of SNPs:", "HGDP00552 - number of indels:", 
"HGDP00552 - number of SNPs:", "HGDP00553 - number of indels:", 
"HGDP00553 - number of SNPs:", "HGDP00554 - number of indels:", 
"HGDP00554 - number of SNPs:", "HGDP00555 - number of indels:", 
"HGDP00555 - number of SNPs:", "HGDP00556 - number of indels:", 
"HGDP00556 - number of SNPs:", "HGDP00569 - number of indels:", 
"HGDP00569 - number of SNPs:", "HGDP00597 - number of indels:", 
"HGDP00597 - number of SNPs:", "HGDP00616 - number of indels:", 
"HGDP00616 - number of SNPs:", "HGDP00650 - number of indels:", 
"HGDP00650 - number of SNPs:", "HGDP00656 - number of indels:", 
"HGDP00656 - number of SNPs:", "HGDP00660 - number of indels:", 
"HGDP00660 - number of SNPs:", "HGDP00702 - number of indels:", 
"HGDP00702 - number of SNPs:", "HGDP00706 - number of indels:", 
"HGDP00706 - number of SNPs:"), population_ID = c("MENA", "MENA", 
"MENA", "MENA", "EUR", "EUR", "SIB", "SIB", "SIB", "SIB", "SIB", 
"SIB", "SIB", "SIB", "SIB", "SIB", "EUR", "EUR", "EUR", "EUR", 
"AFR", "AFR", "AFR", "AFR", "AFR", "AFR", "SAS", "SAS", "SAS", 
"SAS", "SIB", "SIB", "SIB", "SIB", "CEA", "CEA", "CEA", "CEA", 
"EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "CEA", "CEA", "CEA", 
"CEA", "AFR", "AFR", "AFR", "AFR", "AFR", "AFR", "OCE", "OCE", 
"OCE", "OCE", "SIB", "SIB", "EUR", "EUR", "EUR", "EUR", "EUR", 
"EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", 
"EUR", "EUR", "EUR", "EUR", "CEA", "CEA", "CEA", "CEA", "AFR", 
"AFR", "SAS", "SAS", "AFR", "AFR", "SAS", "SAS", "SAS", "SAS", 
"SAS", "SAS", "AFR", "AFR", "SAS", "SAS", "SAS", "SAS", "AFR", 
"AFR", "AFR", "AFR", "AFR", "AFR", "SAS", "SAS", "SAS", "SAS", 
"SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", 
"SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", 
"SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", 
"SAS", "AFR", "AFR", "AFR", "AFR", "AFR", "AFR", "AFR", "AFR", 
"AFR", "AFR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "OCE", 
"OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE", 
"OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE", 
"OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE", 
"OCE", "OCE", "MENA", "MENA", "MENA", "MENA", "MENA", "MENA", 
"MENA", "MENA", "OCE", "OCE", "OCE", "OCE", "AME", "AME", "AME", 
"AME"), variant_type = c("INDELs", "SNPs", "INDELs", "SNPs", 
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", 
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", 
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", 
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", 
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", 
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", 
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", 
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", 
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", 
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", 
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", 
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", 
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", 
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", 
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", 
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", 
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", 
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", 
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", 
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", 
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", 
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", 
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", 
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", 
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", 
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", 
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", 
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs"), 
    count = c(1264381, 4061425, 1199061, 4037274, 1185344, 4099792, 
    1194583, 3922119, 1100046, 4044278, 1204506, 4085199, 1209837, 
    4004536, 1204032, 3919510, 1194726, 4074306, 1260282, 4020396, 
    1472537, 4799240, 1355341, 4777319, 1336151, 4434615, 1286271, 
    3950042, 1148041, 4031256, 1286094, 4043887, 1303953, 3943315, 
    1280312, 3521944, 1309101, 3895526, 1238634, 4031151, 1169710, 
    4034995, 1284225, 4103933, 1258623, 3893207, 1268099, 3713143, 
    1408853, 4651508, 1688403, 4569640, 1358222, 4671708, 1253681, 
    3935969, 1232876, 3879205, 1236709, 3805017, 1213350, 4011139, 
    1148700, 3994066, 1219451, 3944474, 1258853, 3944496, 1144637, 
    3872653, 1094672, 3958888, 1200715, 3842474, 1200850, 3973864, 
    1250989, 4030366, 1205467, 4031375, 1196896, 4014080, 1423998, 
    4728246, 1074264, 3702772, 1433155, 4732714, 1104688, 4020231, 
    1237600, 4103650, 1099670, 4064444, 1380191, 4741223, 1288759, 
    4117500, 1107368, 4053538, 1468658, 4831580, 1402564, 4836735, 
    1300433, 4705708, 1123344, 3992568, 1148199, 4064246, 1262818, 
    3871160, 1270389, 3850498, 1190022, 3381020, 1171221, 3305255, 
    1219880, 3974099, 1259214, 4076346, 1266344, 4016714, 1229641, 
    4066308, 1147674, 3943595, 1174081, 4079386, 1139624, 4025447, 
    1188257, 3965179, 1170802, 4070033, 1176539, 4007922, 1355883, 
    4724450, 1495459, 5022358, 1511687, 5027885, 1517409, 5042050, 
    1327472, 4713283, 1186108, 3895894, 1222016, 3930073, 1711515, 
    2462540, 1185972, 3874880, 1040539, 3729550, 1245745, 3652802, 
    1237516, 3844116, 1754929, 2596873, 1217544, 3919361, 1250070, 
    3910599, 1194096, 3916337, 1201917, 3807126, 1231705, 3995963, 
    1214043, 3921974, 1172066, 3932330, 1120326, 3939117, 1227334, 
    3663188, 1154826, 3955085, 1272236, 3983159, 1225349, 4002121, 
    1255019, 4041283, 1252579, 4082266, 1201205, 4002263, 1212335, 
    3935764, 1178888, 3767359, 1189394, 3779978)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -200L))

cap

structure(list(population_ID = c("AFR", "MENA", "EUR", "SAS", 
"CEA", "SIB", "OCE", "AME", "AFR", "MENA", "EUR", "SAS", "CEA", 
"SIB", "OCE", "AME"), variant_type = c("SNPs", "SNPs", "SNPs", 
"SNPs", "SNPs", "SNPs", "SNPs", "SNPs", "INDELs", "INDELs", "INDELs", 
"INDELs", "INDELs", "INDELs", "INDELs", "INDELs"), count = c(9745226, 
4437055, 5089286, 5186901, 4838976, 4097853, 3968320, 3013334, 
3e+06, 3e+06, 3e+06, 3e+06, 3e+06, 3e+06, 3e+06, 3e+06)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -16L))

P.S. in cap I set all INDELs values to 3M as I'm still generating the data, this should be good for testing anyway


Solution

  • We should be able to join the cap data (keeping the max_count separate from count) and plot that as another layer that uses one of each plotted observation, provided we keep each of the variables mentioned in your global aes():

    variant_sort %>%
      left_join(df_var) %>%
      left_join(cap %>% rename(max_count = count)) %>%
      ...
      geom_line(aes(y = max_count, group = 1),
                data = . %>% distinct(pop_count, variant_type, max_count, population_ID)) +
      ...
    

    By left-joining before we create pop_count and make it into an ordered factor, the data in this layer's subset will have the same x values and ordering as the other layers.

    enter image description here