rdplyrfilteringtidyr

filter for layered logic across columns - values higher or lower by type


In R, trying to implement the following filtering logic by group on a large dataset:

Within each group:

If more than one L, keep the row with the lowest value of L.

If more than one N, keep the row with the highest value of N.

If both L and N, remove any row where N is higher than L.

If both L and N, keep the row with the highest value of N below the lowest value of L (in addition to lowest value of L).

Keep all values of B.

sample data:

dat <- data.frame(group=c("AB","AB","AB","AB","BC","BC","B","B","AD","AD","AD","G"),
type=c("B","L","N","N","N","L","N","N","B","L","L","L"),
value=c(2,4,3,2,5,3,8,9,4,3,9,7))

enter image description here

desired output:

desired_output <- data.frame(group=c("AB","AB","AB","BC","B","AD","AD","G"),
type=c("B","L","N","L","N","B","L","L"),
value=c(2,4,3,3,9,4,3,7))

enter image description here

Looking for a dplyr/tidyr solution. I've tried filtering logic after pivot_wider or case_when within filter, but I haven't gotten very close. I was expecting this to be simple but applying the logic across columns has me stumped.

This is along the lines of what I was thinking, but it does not produce the desired output (e.g., for L takes min across all types within group instead of only within L):

df <- dat %>%
group_by(group) %>% 
filter(type=="B"|type=="L" & value==min(value)|type=="N" & value==max(value))

Solution

  • You may try :

    ### Packages
    library(dplyr)
    library(tidyr)
    
    ### Data
    dat <- data.frame(group=c("AB","AB","AB","AB","BC","BC","B","B","AD","AD","AD","G"),
                      type=c("B","L","N","N","N","L","N","N","B","L","L","L"),
                      value=c(2,4,3,2,5,3,8,9,4,3,9,7))
    
    ### We add the number of L and N for each group
    dat2=dat %>%
      group_by(group) %>%
      mutate(nb_L = sum (type == "L"),
             nb_N = sum (type == "N")) %>%
      ungroup()
    
    ### We create 3 dataframes that respect your conditions
    a=dat2 %>% group_by(group) %>% filter(nb_L>1&type=="L") %>% slice_min(value,n = 1) %>% ungroup()
    b=dat2 %>% group_by(group) %>% filter(nb_N>1&type=="N") %>% slice_max(value,n=1) %>% ungroup()
    c=dat2 %>% group_by(group) %>% filter(type=="B"|(nb_L<=1&type=="L")|(nb_N<=1&type=="N")) %>% ungroup()
    
    ### We stack the dataframes
    dat2=bind_rows(a,b,c) %>% ungroup()
    
    ### We add the value of L and N for each group
    ### We remove the rows regarding the rest of your criterias
    dat2=dat2 %>%
      group_by(group) %>%
      mutate(val_L = ifelse(type == "L", value, NA_real_),
             val_N = ifelse(type == "N", value, NA_real_)) %>%
      fill(c(val_L,val_N), .direction = "downup") %>%
      mutate(across(c(val_L,val_N),~replace_na(.x,0)),
             keep=case_when(nb_L>0&type=="N"&val_N>val_L~"remove",.default = "keep")) %>% 
      filter(keep=="keep") %>%
      select(group,type,value) %>% 
      arrange(group,type) %>% 
      ungroup()
    

    Output :

    # A tibble: 8 × 3
      group type  value
      <chr> <chr> <dbl>
    1 AB    B         2
    2 AB    L         4
    3 AB    N         3
    4 AD    B         4
    5 AD    L         3
    6 B     N         9
    7 BC    L         3
    8 G     L         7