rdataframegroupingvariable-assignmentclustering-key

Create and assign groups based on overlapping/nonoverlapping values between two columns


I have a data frame which includes min and max columns:

df <- data.frame(min=c(2, 4, 3, 3, 2, 6),
                 max=c(2.9, 5.9, 3.9, 4.9, 7.9, 7.9))

I am interested in creating and assigning groups based on overlap/non-overlap between the two columns, where rows are given a new letter if they overlap with no previous rows, and then that letter is given to all rows which overlap with that row.

For example, for the first two rows, there is no overlap, so they get different letters ("a" and "b" respectively). But row 1 overlaps with row 5, so row 5 gets the letter a too. If a row is completely inside another it would receive the same letter code.

For the data frame above, the desired output would be the group column:

  min max group
1   2 2.9     a
2   4 5.9     b
3   3 3.9     c
4   3 4.9    bc
5   2 7.9  abcd
6   6 7.9     d

Essentially this is similar to using letter codes to denote statistical significance (in the spirit of "compact letter display" cld::multcomp) but these are max and min values. I have a large data frame and would like to automate this process with the ultimate goal of placing the letter code above a geom_errorbars in ggplot2.

Trying suggestion by @Mark works but seems to give a different set of answers when the max/min columns have more than 1 decimal place. Any idea what's happening?

structure(list(site = structure(1:9, levels = c("TR", "SF", 
"AR", "MR", "PD", "PL", "GC", "WM", "EM"), class = c("ordered", 
"factor")), mean = c(18.160173892, 17.4521769872, 12.078365989, 
20.4040729028, 25.272373546, 29.216489076, 7.1171799882, 7.3105614156, 
9.781365245), min = c(14.1, 12.6, 9.2, 15.1, 19.8, 22.3, 3.9, 
4.8, 6.2), max = c(22.3, 22.9, 15.2, 31.2, 31.3, 36.5, 11.1, 
9.8, 14.7), min2 = c(14.12, 12.62, 9.21, 15.06, 19.76, 22.26, 
3.87, 4.84, 6.24), max2 = c(22.35, 22.95, 15.22, 31.16, 31.32, 
36.49, 11.11, 9.84, 14.7), overlap = c("a", "ab", "ac", "ab", 
"ab", "b", "c", "c", "ac"), overlap2 = c("a", "a", "ab", "a", 
"a", "a", "b", "b", "ab")), row.names = c(1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 9L), class = "data.frame")

It is much easier to see with the plots.

ggplot(data=testdf2, aes(x = site, y = mean)) +
  geom_errorbar(aes(ymin=min, ymax=max), size=1, width=0.3) +
  geom_col(alpha=0.5) +
  ylab("One-decimal min max") +
  geom_text(aes(label = overlap, y = max), vjust = -.5) +
  theme_bw()

ggplot(data=testdf2, aes(x = site, y = mean)) +
  geom_errorbar(aes(ymin=min2, ymax=max2), size=1, width=0.3) +
  geom_col(alpha=0.5) +
  ylab("Two-decimal min max") +
  geom_text(aes(label = overlap2, y = max2), vjust = -.5) +
  theme_bw()

Solution

  • # make an empty column for the overlaps
    df$overlap <- ""
    
    # create letter index
    letter <- 1
    
    while (any(df$overlap == "")) {
        # get the first row that has an NA overlap
        row <- which(df$overlap == "")[1]
    
        # if the ranges overlap, then add the letter to the overlap column, otherwise add nothing
        new_vals <- if_else(df$min < df$max[row] & df$max > df$min[row], letters[letter], "")
        
        # increment letter index
        letter = letter + 1
    
        # add the new values to the overlap column
        df$overlap <- paste0(df$overlap, new_vals)
    }