rselectdplyrsequencezoo

Select rows per group with consecutive 1s


I have a dataframe that I want to subset by selecting only consecutive values of "1".

Specifically, I have a dataframe that looks like this:

library(tidyverse)
library(zoo)

df <- data.frame(matrix(ncol = 3, nrow = 17))
colnames(df) <- c("row_id","id", "k_yes")
df$row_id <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17)
df$id <- c("1_1","1_1","1_1","1_1","1_1","1_1","1_2","1_2","1_2","1_2","1_2","1_2","1_3","1_3","1_3","1_3","1_3")
df$k_yes <- c(1,1,1,0,1,1,0,0,0,1,1,0,1,0,1,0,1)
df
   row_id    id  k_yes
1       1   1_1      1
2       2   1_1      1
3       3   1_1      1
4       4   1_1      0
5       5   1_1      1
6       6   1_1      1
7       7   1_2      0
8       8   1_2      0
9       9   1_2      0
10     10   1_2      1
11     11   1_2      1
12     12   1_2      0
13     13   1_3      1
14     14   1_3      0 
15     15   1_3      1
16     16   1_3      0
17     17   1_3      1

And I want to create two datasets:

1) One that, per group (id), only has "1"'s, but always two or more consecutive. So, if there is a "0" between two "1", at least the last "1" should be discarded. So it would accept sequences like e.g. 1-1 , 1-1-1, and so on, but not 0-1-1 or 1-1-1-0

Then, I also would like a column to be added to know which new group/sequence has been formed, as it will not always be the same as id (if there are two subgroups/sequences within id). Basically, this column should have a unique code per sequence for the whole dataframe (now, I just binded the id with a letter, but it can be a consecutive number/letter for example)

This dataframe would look like this:

  row_id    id k_yes  new_group
1      1   1_1     1      1_1_A
2      2   1_1     1      1_1_A
3      3   1_1     1      1_1_A
4      5   1_1     1      1_1_B
5      6   1_1     1      1_1_B
6     10   1_2     1      1_2_A
7     11   1_2     1      1_2_A

2) Another one that, per group (id), accepts one "0" in between "1"s, but not if there are no other "1"s after the "0". So it would accept sequences like e.g. 1-0-1, 1-1-0-1, 1-1-0-1-1, 1-1-0-1-0-1, and so on, but not sequences like 0-1-1 or 1-1-0 or 1-1-0-0-1 (in the later, it would only keep the first 1's). And the same as before for the "new_group" column.

The desired output from df would be:

   row_id    id k_yes  new_group
1       1   1_1     1      1_1_A
2       2   1_1     1      1_1_A
3       3   1_1     1      1_1_A
4       4   1_1     0      1_1_A
5       5   1_1     1      1_1_A
6       6   1_1     1      1_1_A
7      10   1_2     1      1_2_A
8      11   1_2     1      1_2_A
9      12   1_3     1      1_3_A
10     14   1_3     0      1_3_A
11     15   1_3     1      1_3_A
12     16   1_3     0      1_3_A
13     17   1_3     1      1_3_A

In this case, id "1_1" gets only one value for "new group" (1_1_A) because it's all the same sequence (with the 0 included)

I tried following this answer but it didn't work out, as I tried:

> df |>
     group_by(id) |> 
     mutate(b = c(first(k_yes) , zoo::rollsum(k_yes, 1))) |>
     summarise(groups_to_keep = id[which(b >= 2)]) -> gk

Error in `mutate()`:
ℹ In argument: `b = c(first(k_yes), zoo::rollsum(k_yes, 1))`.
ℹ In group 1: `id = "1_1"`.
Caused by error:
! `b` must be size 5 or 1, not 6.

I think the issue is the rollsum() function, but after checking the help page it was still not clear to me how this function should be applied.

Any help is appreciated!

Update:

I added a new version of the dataset, which would create two values for "new group" in the first dataset (1).


Solution

  • For dataset #1:

    df |>
      filter(k_yes & (lag(k_yes) | lead(k_yes)), .by = id) |>
      mutate(new_group = row_number() == 1 | row_id - 1 != lag(row_id), .by = id) |>
      mutate(new_group = paste0(id, "_", LETTERS[cumsum(new_group)]), .by = id)
    

    Dataset 2:

    df |>
      filter(k_yes | (!k_yes & lag(k_yes) & lead(k_yes)), .by = id) |> 
      mutate(new_group = row_number() == 1 | row_id - 1 != lag(row_id), .by = id) |>
      mutate(new_group = paste0(id, "_", LETTERS[cumsum(new_group)]), .by = id)