rsortingdplyrlaglarge-data

strange behavior of lag() in R


I am using a code to filter out a smaller dataset from a larger dataset. I am selecting children under the age of 24 months and another variable (b9) which says if child is living with mother or not. However, there are situations where there could be more than one child under the age of 24 months living with mohter, in this case I want to select only the youngest child arraging by caseid and birth history (bidx). I am using the following code, but it gives me an empty dataframe. I am not getting any error , the code runs but it returns an empty data frame which it should not. It is a logical error.

###SAMPLE Dataframe

structure(list(v001 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), v002 = c(107, 
107, 107, 107, 113, 113, 113, 119, 125, 131, 137, 137, 137, 143, 
143, 143, 149, 149, 15, 15, 15, 15, 15, 15, 156, 156, 21, 21, 
3, 3, 33, 39, 46, 52, 52, 52, 58, 64, 64, 64, 64, 76, 76, 76, 
76, 82, 82, 82, 82, 82), caseid = c("       1 107  2", "       1 107  2", 
"       1 107 10", "       1 107 10", "       1 113  2", "       1 113  2", 
"       1 113  2", "       1 119  3", "       1 125  2", "       1 131  2", 
"       1 137  2", "       1 137  2", "       1 137  2", "       1 143  6", 
"       1 143  6", "       1 143 10", "       1 149  4", "       1 149  4", 
"       1  15  7", "       1  15 18", "       1  15 21", "       1  15 25", 
"       1  15 27", "       1  15 27", "       1 156  2", "       1 156  8", 
"       1  21  2", "       1  21  7", "       1   3  4", "       1   3  4", 
"       1  33  3", "       1  39 11", "       1  46  3", "       1  52  2", 
"       1  52  5", "       1  52  5", "       1  58  4", "       1  64  2", 
"       1  64  3", "       1  64  8", "       1  64  8", "       1  76 22", 
"       1  76 25", "       1  76 29", "       1  76 29", "       1  82  7", 
"       1  82  7", "       1  82 15", "       1  82 21", "       1  82 21"
), v021 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), v022 = c(2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2), v023 = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), bidx = c(1, 2, 
1, 2, 1, 2, 3, 1, 1, 1, 1, 2, 3, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 
2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 
2, 1, 2, 1, 1, 2), b9 = c(0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0), age = c(30, 
40, 2, 41, 3, 14, 41, 21, 19, 5, 1, 27, 57, 10, 59, 16, 7, 53, 
29, 50, 33, 5, 2, 35, 0, 10, 31, 22, 0, 55, 14, 17, 1, 34, 23, 
54, 6, 55, 0, 11, 45, 43, 29, 43, 57, 5, 44, 27, 5, 39)), row.names = c(NA, 
50L), class = "data.frame")


#create subset of KRfile to select for children under age of 24 months living with their mother

KRiycf <- KRdata %>%
  subset(age < 24 & b9==0 & !is.na(b9)) %>% # children under 24 months living at home
  arrange(caseid, bidx) %>% # make sure the data is sorted
  subset(is.na(lag(caseid)) | caseid!=lag(caseid)) # select just the youngest

#Above code gives me an empty dataframe

I tried couple of things

KRiycf <- KRdata %>%
  subset(age < 24 & b9 == 0) %>%
  arrange(caseid, bidx) %>%
  subset(is.na(lag(caseid, default = NA)) | caseid != lag(caseid, default = NA))
##If lag() not returning NA properly, it could be an issue so I tried this but did not work.

KRiycf <- KRdata %>%
  subset(age < 24 & b9==0 & !is.na(b9)) %>% 
  arrange(caseid, bidx) %>% 
  subset(is.na(lag(caseid)) | caseid!=lag(caseid))
#I thought if there was a NA value in b9 there could be issue with sorting, but that di also not work. 

Solution

  • If your goal is to get only the youngest child by caseid that meets your criteria, I dont think you need lag() or arrange() at all. Instead you can use dplyr::slice_min() by caseid:

    library(dplyr) 
    
    KRdata %>%
      subset(age < 24 & b9 %in% 0) %>%
      slice_min(age, by = caseid)
    

    Output:

    # for readability to check if this matched your 
    # desired output, adding a line to arrange by your
    # original variables, though solely cosmetic
    
    KRdata %>%
      subset(age < 24 & b9 %in% 0) %>%
      slice_min(age, by = caseid) %>%
      arrange(caseid, bidx)
    
       v001 v002          caseid v021 v022 v023 bidx b9 age
    1     1    3        1   3  4    1    2    2    1  0   0
    2     1   15        1  15 25    1    2    2    1  0   5
    3     1   15        1  15 27    1    2    2    1  0   2
    4     1   21        1  21  7    1    2    2    1  0  22
    5     1   33        1  33  3    1    2    2    1  0  14
    6     1   39        1  39 11    1    2    2    1  0  17
    7     1   46        1  46  3    1    2    2    1  0   1
    8     1   52        1  52  5    1    2    2    1  0  23
    9     1   58        1  58  4    1    2    2    1  0   6
    10    1   64        1  64  3    1    2    2    1  0   0
    11    1   64        1  64  8    1    2    2    1  0  11
    12    1   82        1  82  7    1    2    2    1  0   5
    13    1   82        1  82 21    1    2    2    1  0   5
    14    1  107        1 107 10    1    2    2    1  0   2
    15    1  113        1 113  2    1    2    2    1  0   3
    16    1  119        1 119  3    1    2    2    1  0  21
    17    1  125        1 125  2    1    2    2    1  0  19
    18    1  131        1 131  2    1    2    2    1  0   5
    19    1  137        1 137  2    1    2    2    1  0   1
    20    1  143        1 143  6    1    2    2    1  0  10
    21    1  143        1 143 10    1    2    2    1  0  16
    22    1  149        1 149  4    1    2    2    1  0   7
    23    1  156        1 156  2    1    2    2    1  0   0
    24    1  156        1 156  8    1    2    2    1  0  10
    

    You didnt provide your desired output with these data, so if this isnt exactly correct please let me know and I'll update!