rdplyrrsample

Unnesting deep lists after applying the rolling_origin function from the rsample package


I have some data which looks like:

head:

  dfID  date       group groupValues
  <chr> <date>     <chr>       <dbl>
1 df1   2020-03-01 grp1       0.175 
2 df1   2020-03-01 grp2       0.150 
3 df1   2020-03-01 grp3       0.0509

tail:

  dfID  date       group groupValues
  <chr> <date>     <chr>       <dbl>
1 df3   2020-06-29 grp7        0.705
2 df3   2020-06-29 grp8        0.473
3 df3   2020-06-29 grp9        0.900

Which is a time series data with 3 unique dfID's and 9 unique group's. Filtering the date column to a single day I have (3 df's and 9 groups):

df %>% 
  filter(date == "2020-03-01")

  dfID  date       group groupValues
  <chr> <date>     <chr>       <dbl>
1 df1   2020-03-01 grp1       0.175 
2 df1   2020-03-01 grp2       0.150 
3 df1   2020-03-01 grp3       0.0509
4 df2   2020-03-01 grp4       0.133 
5 df2   2020-03-01 grp5       0.779 
6 df2   2020-03-01 grp6       0.506 
7 df3   2020-03-01 grp7       0.868 
8 df3   2020-03-01 grp8       0.552 
9 df3   2020-03-01 grp9       0.274

I next want to split the data up into a pairwise combination of the dfID's.

combinedSplit <- combn(levels(as.factor(df$dfID)), m = 2, FUN = function(x)
  df %>% 
    filter(dfID %in% x), simplify = FALSE)

names(combinedSplit) <- combn(levels(as.factor(df$dfID)), m = 2, str_c, collapse="_")

So now I have a list of 3 dfs - one for each pairwise combination of the dfID's:

df1_df2
df1_df3
df2_df3

Now, I want to apply the rolling_origin function from the rsample package to split the time series into training and testing data. I can naively apply the function by mapping over the lists.

rolledData <-  combinedSplit %>% 
  map(., ~ rolling_origin(
    data = .,
    initial = 60,
    assess = 1,
    cumulative = FALSE,
    skip = 0
  )
)

I can access the tail of the first split for the combination df1_df2.

map(rolledData$df1_df2$splits, ~analysis(.x))[[1]] %>% tail()

  dfID  date       group groupValues
  <chr> <date>     <chr>       <dbl>
1 df1   2020-03-19 grp1        0.528
2 df1   2020-03-19 grp2        0.394
3 df1   2020-03-19 grp3        0.532
4 df1   2020-03-20 grp1        0.586
5 df1   2020-03-20 grp2        0.369
6 df1   2020-03-20 grp3        0.153

Which is incorrect. In the rolling_origin function I stated the training/assessment period to be 60 perids (days) but this data ends on the 20th March. This is because it is taking the first 60 observations of the data not of the time series (3 group's * 20 days).

So I want to apply the rolling_origin function to each of the grp's - with each grp having 60 day rolling windows.

Here, I thought it would be best to first nest() the group's and then apply the rolling_origin function, since then each of the group's would be separate and 60 days will correspond to grp1, grp2 and grp3. Then, unnest() the group's to put grp1, grp2 and grp3 back into the same data frame.

nestedRolledData <- map(combinedSplit, ~group_by(.x, group) %>% 
      nest() %>% 
      mutate(
        rolledData = map(data, ~.x %>% 
                           rolling_origin(
                             data = .,
                             initial = 60,
                             assess = 0,
                             cumulative = FALSE,
                             skip = 0
                           )
                         )
        )
)

I am having difficulty with this unnesting.

Inspecting the structure of the lists:

I can access the first splits using map(nestedRolledData$df1_df2$rolledData[[1]]$splits, ~analysis(.x))[[1]] which gives me a data frame of 60 observations starting on 2020-03-01 and ending 2020-04-29. For the second split in this list map(nestedRolledData$df1_df2$rolledData[[1]]$splits, ~analysis(.x))[[2]] I have another data frame containing 60 observations starting this time on the 2020-03-02 and ending on the 2020-04-30 (so this data has been shifted by a single day). I can do this up until [[63]] where it starts on 2020-05-02 and ends on 2020-06-30 (which is the last day in my data).

This is what I want - i.e. the data made the correct time series splits for each grp. Now I want to unnest these and put them back into the correct data frames. Going back to layer 3 of the list which contains [[1]], ... , [[6]] 6 lists. These correspond to each of the 3 groups for the 2 dfIDs. So I essentially want to merge these lists together.

If I were to contruct these manually it might look like:

# grp 1:6 for rolling_origin split 1
map(nestedRolledData$df1_df2$rolledData[[1]]$splits, ~analysis(.x))[[1]]
map(nestedRolledData$df1_df2$rolledData[[2]]$splits, ~analysis(.x))[[1]]
map(nestedRolledData$df1_df2$rolledData[[3]]$splits, ~analysis(.x))[[1]]
map(nestedRolledData$df1_df2$rolledData[[4]]$splits, ~analysis(.x))[[1]]
map(nestedRolledData$df1_df2$rolledData[[5]]$splits, ~analysis(.x))[[1]]
map(nestedRolledData$df1_df2$rolledData[[6]]$splits, ~analysis(.x))[[1]]

# grp 1:6 for rolling_origin split 2
map(nestedRolledData$df1_df2$rolledData[[1]]$splits, ~analysis(.x))[[2]]
map(nestedRolledData$df1_df2$rolledData[[2]]$splits, ~analysis(.x))[[2]]
map(nestedRolledData$df1_df2$rolledData[[3]]$splits, ~analysis(.x))[[2]]
map(nestedRolledData$df1_df2$rolledData[[4]]$splits, ~analysis(.x))[[2]]
map(nestedRolledData$df1_df2$rolledData[[5]]$splits, ~analysis(.x))[[2]]
map(nestedRolledData$df1_df2$rolledData[[6]]$splits, ~analysis(.x))[[2]]

# ...

# grp 1:6 for rolling_origin split 63
map(nestedRolledData$df1_df2$rolledData[[1]]$splits, ~analysis(.x))[[63]]
map(nestedRolledData$df1_df2$rolledData[[2]]$splits, ~analysis(.x))[[63]]
map(nestedRolledData$df1_df2$rolledData[[3]]$splits, ~analysis(.x))[[63]]
map(nestedRolledData$df1_df2$rolledData[[4]]$splits, ~analysis(.x))[[63]]
map(nestedRolledData$df1_df2$rolledData[[5]]$splits, ~analysis(.x))[[63]]
map(nestedRolledData$df1_df2$rolledData[[6]]$splits, ~analysis(.x))[[63]]

Data:

library(rsample)
df1 = data.frame(
  date = seq.Date(from = as.Date("2020-03-01"), to = as.Date("2020-06-30"), by = "days"),
  grp1 = runif(122),
  grp2 = runif(122),
  grp3 = runif(122)
) %>% 
  pivot_longer(cols = where(is.numeric), names_to = "group", values_to = "groupValues") %>% 
  add_column(dfID = "df1")

df2 = data.frame(
  date = seq.Date(from = as.Date("2020-03-01"), to = as.Date("2020-06-30"), by = "days"),
  grp4 = runif(122),
  grp5 = runif(122),
  grp6 = runif(122)
) %>% 
  pivot_longer(cols = where(is.numeric), names_to = "group", values_to = "groupValues") %>% 
  add_column(dfID = "df2")

df3 = data.frame(
  date = seq.Date(from = as.Date("2020-03-01"), to = as.Date("2020-06-30"), by = "days"),
  grp7 = runif(122),
  grp8 = runif(122),
  grp9 = runif(122)
) %>% 
  pivot_longer(cols = where(is.numeric), names_to = "group", values_to = "groupValues") %>% 
  add_column(dfID = "df3")

df <- bind_rows(df1, df2, df3) %>% 
  relocate(dfID, .before = date)



map(rolledData$splits, ~analysis(.x))[[1]] %>% tail()

combinedSplit <- combn(levels(as.factor(df$dfID)), m = 2, FUN = function(x)
  df %>% 
    filter(dfID %in% x), simplify = FALSE)

names(combinedSplit) <- combn(levels(as.factor(df$dfID)), m = 2, str_c, collapse="_")

rolledData <-  combinedSplit %>% 
  map(., ~ rolling_origin(
    data = .,
    initial = 60,
    assess = 1,
    cumulative = FALSE,
    skip = 0
  )
)

map(rolledData$df1_df2$splits, ~analysis(.x))[[1]] %>% tail()


nestedRolledData <- map(combinedSplit, ~group_by(.x, group) %>% 
      nest() %>% 
      mutate(
        rolledData = map(data, ~.x %>% 
                           rolling_origin(
                             data = .,
                             initial = 60,
                             assess = 0,
                             cumulative = FALSE,
                             skip = 0
                           )
                         )
        )
)

Solution

  • May be we need a 3 nested map

    library(purrr)
    library(rsample)
    out <- map(nestedRolledData, ~ map(.x$rolledData, ~ map(.x$splits, analysis)))
    
    str(out, max.level = 3)
    #List of 3
    # $ df1_df2:List of 6
    #  ..$ :List of 63
    #  .. ..$ : tibble [60 × 3] (S3: tbl_df/tbl/data.frame)
    #  .. ..$ : tibble [60 × 3] (S3: tbl_df/tbl/data.frame)
    #  .. ..$ : tibble [60 × 3] (S3: tbl_df/tbl/data.frame)
    #  .. ..$ : tibble [60 × 3] (S3: tbl_df/tbl/data.frame)
    #  .. ..$ : tibble [60 × 3] (S3: tbl_df/tbl/data.frame)
    # ..