rfor-loopdplyr

annotating grouped blocks of identical values within an R dataframe - without a for loop


I have some data to plot, which for purposes of organization and annotation are broken into several blocks, defined by a combination of variables including the one I'm calling group in my code. The idea is that I want to differently plot the first batch of group "A" from the second batch of "A", same with "B" and "C," etc. So I want to mark each batch with a unique identifier, which here I'm calling plot_group. To do this, I'd like to march along the length of group, and increment plot_group by 1 every time I move from one group to another.

I've figured out how to do this with a for loop, below, but it looks ugly and I'd rather be able to do it with a vectorized function. I can't for the life of me figure out how, though, even using things like seq_along and lag, and the problem seems to be that a function can't refer to its own output on the fly.

There must be a dumb and obvious thing I'm missing, since this is hardly a sophisticated problem. Does anyone have a recommendation?

# vector of groups - repeated twice
group <- c(rep(c(rep('A', 2), rep('B', 4), rep('C', 3)),2))

# run through the "group" variable, incrementing plot_group by one every time a new group is encountered
for (i in seq_along(group)) {
  # if we are at the beginning, initiate the first group as "1"
  if(i==1) plot_group <- 1
  # otherwise, check if we are at a new group - if so, increment plot_group by 1
  else {
    if (group[i] != group[i-1]) plot_group <- c(plot_group, plot_group[i-1]+1)
    # if not, then just return the current plot_group variable
    else plot_group <- c(plot_group, plot_group[i-1])
  }
}

tibble(group=group, plot_group=plot_group)

# returns what I want:
## A tibble: 18 × 2
# group plot_group
# <chr>      <dbl>
# 1 A               1
# 2 A               1
# 3 B               2
# 4 B               2
# 5 B               2
# 6 B               2
# 7 C               3
# 8 C               3
# 9 C               3
# 10 A              4
# 11 A              4
# 12 B              5
# 13 B              5
# 14 B              5
# 15 B              5
# 16 C              6
# 17 C              6
# 18 C              6

rm(plot_group)

# do the same as the above, but with sapply
plot_group <- sapply(seq_along(group), function(i) {
  if(i==1) return(1)
  else {
    if (group[i] != group[i-1]) return(plot_group[i-1] + 1)
    else return(plot_group[i-1])
  }
})
# returns "Error in FUN(X[[i]], ...) : object 'plot_group' not found"

Solution

  • Using base::rle:

    group <- c(rep(c(rep('A', 2), rep('B', 4), rep('C', 3)),2))
    
    data.frame(group, 
               plot_group = rle(group)$length |> 
                 (\(.x) rep(seq_along(.x), .x))())
    
    #>    group plot_group
    #> 1      A          1
    #> 2      A          1
    #> 3      B          2
    #> 4      B          2
    #> 5      B          2
    #> 6      B          2
    #> 7      C          3
    #> 8      C          3
    #> 9      C          3
    #> 10     A          4
    #> 11     A          4
    #> 12     B          5
    #> 13     B          5
    #> 14     B          5
    #> 15     B          5
    #> 16     C          6
    #> 17     C          6
    #> 18     C          6
    

    or in dplyr:

    tibble(group) %>% 
      mutate(consecutive_id(group))
    

    Created on 2025-03-18 with reprex v2.1.1