I would like to add some points to a geom_violin
in ggplot2
. Below an overview of how the figure is structured:
Essentially, for each one of the eight populations there should be an additional extra value which will be plotted as a single point which happes to be way above the violin shapes; in theory, representing for a cap count in that population. I wish to also connect these dots with a line if possible.
Here, is the code I've been working with:
library(dplyr)
library(readxl)
library(tibble)
library(ggplot2)
library(hrbrthemes)
library(introdataviz)
### IMPORT THE DATASET
variants_dist <- read_excel("path/to/.xlsm", 11)
cap <- read_excel("path/to.xlsm", 12)
### FEATURES WRANGLING TO GET THE RIGHT FORMAT
variants_dist <- variants_dist %>%
mutate(population_ID=factor(population_ID, levels=c("AFR", "EUR", "MENA", "SAS", "CEA", "SIB", "OCE", "AME")))
variants_dist %>% arrange(population_ID) -> pop_sort
pop_sort <- pop_sort %>%
mutate(variant_type=factor(variant_type, levels=c("SNPs", "INDELs")))
pop_sort %>% arrange(variant_type) -> variant_sort
df_var = variant_sort %>% group_by(population_ID) %>% summarise(num=n())
### PLOT THE DATA
violin_variants <- variant_sort %>%
left_join(df_var) %>%
mutate(pop_count = paste0(population_ID, "\n", "n=", num/2)) %>%
ggplot(aes(x=forcats::fct_inorder(pop_count), y=count, fill=population_ID)) +
geom_violin(position="dodge", trim=FALSE) +
geom_boxplot(width=0.07, color="black", alpha=0.6) +
scale_fill_manual(values=c(EUR="dodgerblue2", MENA="mediumvioletred", SIB="darkkhaki",
CEA="firebrick2", AFR="olivedrab2", OCE="powderblue",
SAS="darksalmon", AME="plum2")) +
theme_bw() +
theme(
legend.position="none",
) +
xlab("") + ylab("")
violin_variants +
facet_wrap(~variant_type, scales="free_y", ncol=1, switch="y")
Although this is something I've done in more trivial cases, in this example because of the operations done on the first data frame (variants_dist
) it has been proven harder than i though and few attempts led to wanky outputs... To reproduce, here are two dput()
of variant_dist
and cap
.
variant_dist
structure(list(samples = c("abh100 - number of indels:", "abh100 - number of SNPs:",
"abh107 - number of indels:", "abh107 - number of SNPs:", "ALB212 - number of indels:",
"ALB212 - number of SNPs:", "Ale14 - number of indels:", "Ale14 - number of SNPs:",
"Ale20 - number of indels:", "Ale20 - number of SNPs:", "Ale22 - number of indels:",
"Ale22 - number of SNPs:", "Ale32 - number of indels:", "Ale32 - number of SNPs:",
"altai363p - number of indels:", "altai363p - number of SNPs:",
"armenia293 - number of indels:", "armenia293 - number of SNPs:",
"Armenian222 - number of indels:", "Armenian222 - number of SNPs:",
"Ayodo_430C - number of indels:", "Ayodo_430C - number of SNPs:",
"Ayodo_502C - number of indels:", "Ayodo_502C - number of SNPs:",
"Ayodo_81S - number of indels:", "Ayodo_81S - number of SNPs:",
"B11 - number of indels:", "B11 - number of SNPs:", "B17 - number of indels:",
"B17 - number of SNPs:", "Bishkek28439 - number of indels:",
"Bishkek28439 - number of SNPs:", "Bishkek28440 - number of indels:",
"Bishkek28440 - number of SNPs:", "Bu16 - number of indels:",
"Bu16 - number of SNPs:", "Bu5 - number of indels:", "Bu5 - number of SNPs:",
"BulgarianB4 - number of indels:", "BulgarianB4 - number of SNPs:",
"BulgarianC1 - number of indels:", "BulgarianC1 - number of SNPs:",
"ch113 - number of indels:", "ch113 - number of SNPs:", "CHI-007 - number of indels:",
"CHI-007 - number of SNPs:", "CHI-034 - number of indels:", "CHI-034 - number of SNPs:",
"DNK05 - number of indels:", "DNK05 - number of SNPs:", "DNK07 - number of indels:",
"DNK07 - number of SNPs:", "DNK11 - number of indels:", "DNK11 - number of SNPs:",
"Dus16 - number of indels:", "Dus16 - number of SNPs:", "Dus22 - number of indels:",
"Dus22 - number of SNPs:", "Esk29 - number of indels:", "Esk29 - number of SNPs:",
"Est375 - number of indels:", "Est375 - number of SNPs:", "Est400 - number of indels:",
"Est400 - number of SNPs:", "HG00126 - number of indels:", "HG00126 - number of SNPs:",
"HG00128 - number of indels:", "HG00128 - number of SNPs:", "HG00174 - number of indels:",
"HG00174 - number of SNPs:", "HG00190 - number of indels:", "HG00190 - number of SNPs:",
"HG00360 - number of indels:", "HG00360 - number of SNPs:", "HG01503 - number of indels:",
"HG01503 - number of SNPs:", "HG01504 - number of indels:", "HG01504 - number of SNPs:",
"HG01600 - number of indels:", "HG01600 - number of SNPs:", "HG01846 - number of indels:",
"HG01846 - number of SNPs:", "HG02464 - number of indels:", "HG02464 - number of SNPs:",
"HG02494 - number of indels:", "HG02494 - number of SNPs:", "HG02574 - number of indels:",
"HG02574 - number of SNPs:", "HG02724 - number of indels:", "HG02724 - number of SNPs:",
"HG02783 - number of indels:", "HG02783 - number of SNPs:", "HG02790 - number of indels:",
"HG02790 - number of SNPs:", "HG02943 - number of indels:", "HG02943 - number of SNPs:",
"HG03006 - number of indels:", "HG03006 - number of SNPs:", "HG03007 - number of indels:",
"HG03007 - number of SNPs:", "HG03078 - number of indels:", "HG03078 - number of SNPs:",
"HG03085 - number of indels:", "HG03085 - number of SNPs:", "HG03100 - number of indels:",
"HG03100 - number of SNPs:", "HGDP00019 - number of indels:",
"HGDP00019 - number of SNPs:", "HGDP00027 - number of indels:",
"HGDP00027 - number of SNPs:", "HGDP00058 - number of indels:",
"HGDP00058 - number of SNPs:", "HGDP00090 - number of indels:",
"HGDP00090 - number of SNPs:", "HGDP00124 - number of indels:",
"HGDP00124 - number of SNPs:", "HGDP00125 - number of indels:",
"HGDP00125 - number of SNPs:", "HGDP00157 - number of indels:",
"HGDP00157 - number of SNPs:", "HGDP00160 - number of indels:",
"HGDP00160 - number of SNPs:", "HGDP00195 - number of indels:",
"HGDP00195 - number of SNPs:", "HGDP00208 - number of indels:",
"HGDP00208 - number of SNPs:", "HGDP00216 - number of indels:",
"HGDP00216 - number of SNPs:", "HGDP00232 - number of indels:",
"HGDP00232 - number of SNPs:", "HGDP00286 - number of indels:",
"HGDP00286 - number of SNPs:", "HGDP00328 - number of indels:",
"HGDP00328 - number of SNPs:", "HGDP00338 - number of indels:",
"HGDP00338 - number of SNPs:", "HGDP00428 - number of indels:",
"HGDP00428 - number of SNPs:", "HGDP00449 - number of indels:",
"HGDP00449 - number of SNPs:", "HGDP00457 - number of indels:",
"HGDP00457 - number of SNPs:", "HGDP00461 - number of indels:",
"HGDP00461 - number of SNPs:", "HGDP00474 - number of indels:",
"HGDP00474 - number of SNPs:", "HGDP00476 - number of indels:",
"HGDP00476 - number of SNPs:", "HGDP00526 - number of indels:",
"HGDP00526 - number of SNPs:", "HGDP00530 - number of indels:",
"HGDP00530 - number of SNPs:", "HGDP00533 - number of indels:",
"HGDP00533 - number of SNPs:", "HGDP00540 - number of indels:",
"HGDP00540 - number of SNPs:", "HGDP00541 - number of indels:",
"HGDP00541 - number of SNPs:", "HGDP00543 - number of indels:",
"HGDP00543 - number of SNPs:", "HGDP00545 - number of indels:",
"HGDP00545 - number of SNPs:", "HGDP00546 - number of indels:",
"HGDP00546 - number of SNPs:", "HGDP00547 - number of indels:",
"HGDP00547 - number of SNPs:", "HGDP00548 - number of indels:",
"HGDP00548 - number of SNPs:", "HGDP00549 - number of indels:",
"HGDP00549 - number of SNPs:", "HGDP00550 - number of indels:",
"HGDP00550 - number of SNPs:", "HGDP00551 - number of indels:",
"HGDP00551 - number of SNPs:", "HGDP00552 - number of indels:",
"HGDP00552 - number of SNPs:", "HGDP00553 - number of indels:",
"HGDP00553 - number of SNPs:", "HGDP00554 - number of indels:",
"HGDP00554 - number of SNPs:", "HGDP00555 - number of indels:",
"HGDP00555 - number of SNPs:", "HGDP00556 - number of indels:",
"HGDP00556 - number of SNPs:", "HGDP00569 - number of indels:",
"HGDP00569 - number of SNPs:", "HGDP00597 - number of indels:",
"HGDP00597 - number of SNPs:", "HGDP00616 - number of indels:",
"HGDP00616 - number of SNPs:", "HGDP00650 - number of indels:",
"HGDP00650 - number of SNPs:", "HGDP00656 - number of indels:",
"HGDP00656 - number of SNPs:", "HGDP00660 - number of indels:",
"HGDP00660 - number of SNPs:", "HGDP00702 - number of indels:",
"HGDP00702 - number of SNPs:", "HGDP00706 - number of indels:",
"HGDP00706 - number of SNPs:"), population_ID = c("MENA", "MENA",
"MENA", "MENA", "EUR", "EUR", "SIB", "SIB", "SIB", "SIB", "SIB",
"SIB", "SIB", "SIB", "SIB", "SIB", "EUR", "EUR", "EUR", "EUR",
"AFR", "AFR", "AFR", "AFR", "AFR", "AFR", "SAS", "SAS", "SAS",
"SAS", "SIB", "SIB", "SIB", "SIB", "CEA", "CEA", "CEA", "CEA",
"EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "CEA", "CEA", "CEA",
"CEA", "AFR", "AFR", "AFR", "AFR", "AFR", "AFR", "OCE", "OCE",
"OCE", "OCE", "SIB", "SIB", "EUR", "EUR", "EUR", "EUR", "EUR",
"EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR",
"EUR", "EUR", "EUR", "EUR", "CEA", "CEA", "CEA", "CEA", "AFR",
"AFR", "SAS", "SAS", "AFR", "AFR", "SAS", "SAS", "SAS", "SAS",
"SAS", "SAS", "AFR", "AFR", "SAS", "SAS", "SAS", "SAS", "AFR",
"AFR", "AFR", "AFR", "AFR", "AFR", "SAS", "SAS", "SAS", "SAS",
"SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS",
"SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS",
"SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS",
"SAS", "AFR", "AFR", "AFR", "AFR", "AFR", "AFR", "AFR", "AFR",
"AFR", "AFR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "OCE",
"OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE",
"OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE",
"OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE", "OCE",
"OCE", "OCE", "MENA", "MENA", "MENA", "MENA", "MENA", "MENA",
"MENA", "MENA", "OCE", "OCE", "OCE", "OCE", "AME", "AME", "AME",
"AME"), variant_type = c("INDELs", "SNPs", "INDELs", "SNPs",
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs",
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs",
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs",
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs",
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs",
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs",
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs",
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs",
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs",
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs",
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs",
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs",
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs",
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs",
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs",
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs",
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs",
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs",
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs",
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs",
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs",
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs",
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs",
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs",
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs",
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs",
"INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs",
"SNPs", "INDELs", "SNPs", "INDELs", "SNPs", "INDELs", "SNPs"),
count = c(1264381, 4061425, 1199061, 4037274, 1185344, 4099792,
1194583, 3922119, 1100046, 4044278, 1204506, 4085199, 1209837,
4004536, 1204032, 3919510, 1194726, 4074306, 1260282, 4020396,
1472537, 4799240, 1355341, 4777319, 1336151, 4434615, 1286271,
3950042, 1148041, 4031256, 1286094, 4043887, 1303953, 3943315,
1280312, 3521944, 1309101, 3895526, 1238634, 4031151, 1169710,
4034995, 1284225, 4103933, 1258623, 3893207, 1268099, 3713143,
1408853, 4651508, 1688403, 4569640, 1358222, 4671708, 1253681,
3935969, 1232876, 3879205, 1236709, 3805017, 1213350, 4011139,
1148700, 3994066, 1219451, 3944474, 1258853, 3944496, 1144637,
3872653, 1094672, 3958888, 1200715, 3842474, 1200850, 3973864,
1250989, 4030366, 1205467, 4031375, 1196896, 4014080, 1423998,
4728246, 1074264, 3702772, 1433155, 4732714, 1104688, 4020231,
1237600, 4103650, 1099670, 4064444, 1380191, 4741223, 1288759,
4117500, 1107368, 4053538, 1468658, 4831580, 1402564, 4836735,
1300433, 4705708, 1123344, 3992568, 1148199, 4064246, 1262818,
3871160, 1270389, 3850498, 1190022, 3381020, 1171221, 3305255,
1219880, 3974099, 1259214, 4076346, 1266344, 4016714, 1229641,
4066308, 1147674, 3943595, 1174081, 4079386, 1139624, 4025447,
1188257, 3965179, 1170802, 4070033, 1176539, 4007922, 1355883,
4724450, 1495459, 5022358, 1511687, 5027885, 1517409, 5042050,
1327472, 4713283, 1186108, 3895894, 1222016, 3930073, 1711515,
2462540, 1185972, 3874880, 1040539, 3729550, 1245745, 3652802,
1237516, 3844116, 1754929, 2596873, 1217544, 3919361, 1250070,
3910599, 1194096, 3916337, 1201917, 3807126, 1231705, 3995963,
1214043, 3921974, 1172066, 3932330, 1120326, 3939117, 1227334,
3663188, 1154826, 3955085, 1272236, 3983159, 1225349, 4002121,
1255019, 4041283, 1252579, 4082266, 1201205, 4002263, 1212335,
3935764, 1178888, 3767359, 1189394, 3779978)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -200L))
cap
structure(list(population_ID = c("AFR", "MENA", "EUR", "SAS",
"CEA", "SIB", "OCE", "AME", "AFR", "MENA", "EUR", "SAS", "CEA",
"SIB", "OCE", "AME"), variant_type = c("SNPs", "SNPs", "SNPs",
"SNPs", "SNPs", "SNPs", "SNPs", "SNPs", "INDELs", "INDELs", "INDELs",
"INDELs", "INDELs", "INDELs", "INDELs", "INDELs"), count = c(9745226,
4437055, 5089286, 5186901, 4838976, 4097853, 3968320, 3013334,
3e+06, 3e+06, 3e+06, 3e+06, 3e+06, 3e+06, 3e+06, 3e+06)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -16L))
P.S. in cap I set all INDELs values to 3M as I'm still generating the data, this should be good for testing anyway
We should be able to join the cap data (keeping the max_count separate from count
) and plot that as another layer that uses one of each plotted observation, provided we keep each of the variables mentioned in your global aes()
:
variant_sort %>%
left_join(df_var) %>%
left_join(cap %>% rename(max_count = count)) %>%
...
geom_line(aes(y = max_count, group = 1),
data = . %>% distinct(pop_count, variant_type, max_count, population_ID)) +
...
By left-joining before we create pop_count
and make it into an ordered factor, the data in this layer's subset will have the same x values and ordering as the other layers.