I have a data frame which includes min and max columns:
df <- data.frame(min=c(2, 4, 3, 3, 2, 6),
max=c(2.9, 5.9, 3.9, 4.9, 7.9, 7.9))
I am interested in creating and assigning groups based on overlap/non-overlap between the two columns, where rows are given a new letter if they overlap with no previous rows, and then that letter is given to all rows which overlap with that row.
For example, for the first two rows, there is no overlap, so they get different letters ("a" and "b" respectively). But row 1 overlaps with row 5, so row 5 gets the letter a too. If a row is completely inside another it would receive the same letter code.
For the data frame above, the desired output would be the group column:
min max group
1 2 2.9 a
2 4 5.9 b
3 3 3.9 c
4 3 4.9 bc
5 2 7.9 abcd
6 6 7.9 d
Essentially this is similar to using letter codes to denote statistical significance (in the spirit of "compact letter display" cld::multcomp) but these are max and min values. I have a large data frame and would like to automate this process with the ultimate goal of placing the letter code above a geom_errorbars in ggplot2.
Trying suggestion by @Mark works but seems to give a different set of answers when the max/min columns have more than 1 decimal place. Any idea what's happening?
structure(list(site = structure(1:9, levels = c("TR", "SF",
"AR", "MR", "PD", "PL", "GC", "WM", "EM"), class = c("ordered",
"factor")), mean = c(18.160173892, 17.4521769872, 12.078365989,
20.4040729028, 25.272373546, 29.216489076, 7.1171799882, 7.3105614156,
9.781365245), min = c(14.1, 12.6, 9.2, 15.1, 19.8, 22.3, 3.9,
4.8, 6.2), max = c(22.3, 22.9, 15.2, 31.2, 31.3, 36.5, 11.1,
9.8, 14.7), min2 = c(14.12, 12.62, 9.21, 15.06, 19.76, 22.26,
3.87, 4.84, 6.24), max2 = c(22.35, 22.95, 15.22, 31.16, 31.32,
36.49, 11.11, 9.84, 14.7), overlap = c("a", "ab", "ac", "ab",
"ab", "b", "c", "c", "ac"), overlap2 = c("a", "a", "ab", "a",
"a", "a", "b", "b", "ab")), row.names = c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L), class = "data.frame")
It is much easier to see with the plots.
ggplot(data=testdf2, aes(x = site, y = mean)) +
geom_errorbar(aes(ymin=min, ymax=max), size=1, width=0.3) +
geom_col(alpha=0.5) +
ylab("One-decimal min max") +
geom_text(aes(label = overlap, y = max), vjust = -.5) +
theme_bw()
ggplot(data=testdf2, aes(x = site, y = mean)) +
geom_errorbar(aes(ymin=min2, ymax=max2), size=1, width=0.3) +
geom_col(alpha=0.5) +
ylab("Two-decimal min max") +
geom_text(aes(label = overlap2, y = max2), vjust = -.5) +
theme_bw()
# make an empty column for the overlaps
df$overlap <- ""
# create letter index
letter <- 1
while (any(df$overlap == "")) {
# get the first row that has an NA overlap
row <- which(df$overlap == "")[1]
# if the ranges overlap, then add the letter to the overlap column, otherwise add nothing
new_vals <- if_else(df$min < df$max[row] & df$max > df$min[row], letters[letter], "")
# increment letter index
letter = letter + 1
# add the new values to the overlap column
df$overlap <- paste0(df$overlap, new_vals)
}