rfor-loopdplyr

iteratively assign values with case_when and for loops


I have a dataset that has a bunch of columns for different base variables, and each of those base variables has a year in the name that those values are relevant for. I need to create one column for each base variable that is assigned based on the year of the row. I could just use case_when and list each year and name each base variable+year individually, but that's not efficient and there are multiple base variables with over 20 years listed for each. I've tried a) using a function to write the guts of the case_when function, and b)using case_when in a loop, c) using if in a for loop. The first 2 both result in missings and the last in unevaluated string (the name of the correct variable). I know that I must be missing something pretty fundamental (maybe a different something for each strategy). Reprex below.

dataf <- data.frame(alldate = sample(seq(as.Date('2000/01/01'), as.Date('2005/01/01'), by="day"), 20),
                    var2000 = rnorm(20), 
                    var2001 = rnorm(20),
                    var2002 = rnorm(20),
                    var2003 = rnorm(20),
                    var2004 = rnorm(20),
                    var2005 = rnorm(20))


#a) using a function to write the guts of the case_when function (returns NAs)
assign_var_year <- function(yearstart = 2000, yearend = 2005, base_var = "var" ){
  for(i in seq_along(yearstart:(1+yearend))-1){
    year(alldate) == i+yearstart ~ str_c(base_var, i+yearstart)
  }
}

dataf2 <- dataf %>% 
  mutate(var_test = case_when(
    assign_var_year(),
    T~NA_real_
  ))

dataf2$var_test %>% head


#b) using case_when in a loop (returns NAs)
for(i in seq_along(min(year(dataf$alldate)):(1+max(year(dataf$alldate))))-1){
  dataf$var_test <- case_when(
    year(dataf$alldate) == min(year(dataf$alldate))+i~str_c("var", i+yearstart)
  )
}

dataf$var_test %>% head

#c) using if in a for loop (returns name of correct variable)(see also below equivalent to ##this using case_when)
dataf$var_test <- NA_real_

for(i in seq_along(min(year(dataf$alldate)):(1+max(year(dataf$alldate))))-1){
  dataf$var_test <- ifelse(
    year(dataf$alldate) == min(year(dataf$alldate))+i,eval(str_c("var", i+yearstart)),dataf$var_test
  )
}
dataf %>% select(alldate, var_test) %>% head

#same as above with case_when
for(i in 0:5){
  dataf <- dataf %>%
    mutate(var_test = case_when(
      year(alldate) == 2000+i ~ str_c("var", 2000+i),
      T~var_test,
    ))
    }
dataf$var_test %>% head


Solution

  • I would approach this problem quite differently. Your data is not tidy, and therefore hard to work with. I would first re-organize the data, after which the problem can solved more easily.

    library(tidyverse)
    

    I first put the data in long format, making one column for each variable but spreading the years across rows. The code below generalizes if you have other variables than var, which will be added as additional columns:

    df1 <- dataf |> 
      pivot_longer(
        -alldate, 
        names_to = c(".value", "year"), 
        names_pattern = "([a-zA-Z]+)(\\d{4})", 
        names_transform = list(year = as.numeric)
      )
    
    # A tibble: 120 × 3
       alldate     year     var
       <date>     <dbl>   <dbl>
     1 2004-06-11  2000  1.77  
     2 2004-06-11  2001  0.460 
     3 2004-06-11  2002 -0.692 
     4 2004-06-11  2003 -2.15  
     5 2004-06-11  2004  0.328 
     6 2004-06-11  2005  1.49  
     7 2002-09-19  2000 -0.0568
     8 2002-09-19  2001 -0.735 
     9 2002-09-19  2002 -1.65  
    10 2002-09-19  2003  0.745 
    # ℹ 110 more rows
    # ℹ Use `print(n = ...)` to see more rows
    

    We can then simply filter on matching years:

    filter(df1, year(alldate) == year)
    
    # A tibble: 20 × 3
       alldate     year    var
       <date>     <dbl>  <dbl>
     1 2004-06-11  2004  0.328
     2 2002-09-19  2002 -1.65 
     3 2003-06-30  2003  0.624
     4 2001-10-09  2001 -0.502
     5 2000-07-24  2000  0.123
     6 2004-10-05  2004  1.34 
     7 2000-04-29  2000  0.102
     8 2003-08-22  2003  1.32 
     9 2001-02-19  2001 -2.15 
    10 2003-12-02  2003  1.17 
    11 2002-09-17  2002  2.09 
    12 2000-11-16  2000 -1.11 
    13 2001-10-13  2001  0.909
    14 2002-12-22  2002 -0.275
    15 2004-06-04  2004 -0.722
    16 2004-06-19  2004 -0.626
    17 2004-01-28  2004 -0.300
    18 2004-02-21  2004 -0.313
    19 2003-04-14  2003  0.709
    20 2004-03-17  2004  1.41