In R, trying to implement the following filtering logic by group on a large dataset:
Within each group:
If more than one L, keep the row with the lowest value of L.
If more than one N, keep the row with the highest value of N.
If both L and N, remove any row where N is higher than L.
If both L and N, keep the row with the highest value of N below the lowest value of L (in addition to lowest value of L).
Keep all values of B.
sample data:
dat <- data.frame(group=c("AB","AB","AB","AB","BC","BC","B","B","AD","AD","AD","G"),
type=c("B","L","N","N","N","L","N","N","B","L","L","L"),
value=c(2,4,3,2,5,3,8,9,4,3,9,7))
desired output:
desired_output <- data.frame(group=c("AB","AB","AB","BC","B","AD","AD","G"),
type=c("B","L","N","L","N","B","L","L"),
value=c(2,4,3,3,9,4,3,7))
Looking for a dplyr/tidyr solution. I've tried filtering logic after pivot_wider or case_when within filter, but I haven't gotten very close. I was expecting this to be simple but applying the logic across columns has me stumped.
This is along the lines of what I was thinking, but it does not produce the desired output (e.g., for L takes min across all types within group instead of only within L):
df <- dat %>%
group_by(group) %>%
filter(type=="B"|type=="L" & value==min(value)|type=="N" & value==max(value))
You may try :
### Packages
library(dplyr)
library(tidyr)
### Data
dat <- data.frame(group=c("AB","AB","AB","AB","BC","BC","B","B","AD","AD","AD","G"),
type=c("B","L","N","N","N","L","N","N","B","L","L","L"),
value=c(2,4,3,2,5,3,8,9,4,3,9,7))
### We add the number of L and N for each group
dat2=dat %>%
group_by(group) %>%
mutate(nb_L = sum (type == "L"),
nb_N = sum (type == "N")) %>%
ungroup()
### We create 3 dataframes that respect your conditions
a=dat2 %>% group_by(group) %>% filter(nb_L>1&type=="L") %>% slice_min(value,n = 1) %>% ungroup()
b=dat2 %>% group_by(group) %>% filter(nb_N>1&type=="N") %>% slice_max(value,n=1) %>% ungroup()
c=dat2 %>% group_by(group) %>% filter(type=="B"|(nb_L<=1&type=="L")|(nb_N<=1&type=="N")) %>% ungroup()
### We stack the dataframes
dat2=bind_rows(a,b,c) %>% ungroup()
### We add the value of L and N for each group
### We remove the rows regarding the rest of your criterias
dat2=dat2 %>%
group_by(group) %>%
mutate(val_L = ifelse(type == "L", value, NA_real_),
val_N = ifelse(type == "N", value, NA_real_)) %>%
fill(c(val_L,val_N), .direction = "downup") %>%
mutate(across(c(val_L,val_N),~replace_na(.x,0)),
keep=case_when(nb_L>0&type=="N"&val_N>val_L~"remove",.default = "keep")) %>%
filter(keep=="keep") %>%
select(group,type,value) %>%
arrange(group,type) %>%
ungroup()
Output :
# A tibble: 8 × 3
group type value
<chr> <chr> <dbl>
1 AB B 2
2 AB L 4
3 AB N 3
4 AD B 4
5 AD L 3
6 B N 9
7 BC L 3
8 G L 7