rbootstrappingstatistics-bootstraptidymodelsrsample

Block Bootstrapping using Tidymodels


I have a monthly (Jan - Dec) data set for weather and crop yield. This data is collected for multiple years (2002 - 2019). My aim is to obtain bootstrapped slope coefficient of the affect of temperature in each month on yield gap. In bootstrapping, I want to block the year information in a way that the function should randomly sample data from a specific year in each bootstrap rather than choosing rows from mixed years.

I read some blogs and tried different methods but I am not confident about those. I tried to disect the bootstrapped splits to ensure if I am doing it correctly but I was not.

Here is the starting code:

# Load libraries
library(readxl)
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(reprex)

# data 
ww_wt <- read_csv("https://raw.githubusercontent.com/MohsinRamay/yieldgap/main/ww_wt.csv")
#> New names:
#> * `` -> ...1
#> Rows: 1924 Columns: 20
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr   (3): ID, Location, Month
#> dbl  (16): ...1, Year, Latitude, Longitude, YieldTrt, YieldUntrt, Mildew, Ye...
#> date  (1): Date
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.

ww_wt %>% 
  select(Year, Month, gap, temp) %>%
  head()
#> # A tibble: 6 x 4
#>    Year Month       gap  temp
#>   <dbl> <chr>     <dbl> <dbl>
#> 1  2002 September 0.282 13.6 
#> 2  2002 October   0.282 13.3 
#> 3  2002 November  0.282  7.07
#> 4  2002 December  0.282  3.44
#> 5  2002 January   0.282  5.61
#> 6  2002 February  0.282  6.93

# Bootstrapping
set.seed(123)

boots <- ww_wt %>% 
  ungroup() %>% 
  select(Year, Month, gap, temp) %>%
  nest(data = -c(Month)) %>% 
  mutate(boots = map(data, ~bootstraps(.x, times = 100, apparent = FALSE))) %>%
  unnest(boots) %>% 
  mutate(model = map(splits, ~lm(gap ~ temp, data = analysis(.))),
         coefs = map(model, tidy))

Created on 2022-01-04 by the reprex package (v2.0.1)

I am nesting Months because I want to get slopes for each month separately. Also, the data for each Year has different sample size n because of different number of locations each year.


Solution

  • Thanks for providing these hints Julia. I think I have figured this out by just adding some extra lines of code.

    library(tidyverse)
    library(tidymodels)
    #> Registered S3 method overwritten by 'tune':
    #>   method                   from   
    #>   required_pkgs.model_spec parsnip
    library(viridis)
    #> Loading required package: viridisLite
    #> 
    #> Attaching package: 'viridis'
    #> The following object is masked from 'package:scales':
    #> 
    #>     viridis_pal
    
    ww_wt <- read_csv("https://raw.githubusercontent.com/MohsinRamay/yieldgap/main/ww_wt.csv")
    #> New names:
    #> * `` -> ...1
    #> Rows: 1924 Columns: 20
    #> -- Column specification --------------------------------------------------------
    #> Delimiter: ","
    #> chr   (3): ID, Location, Month
    #> dbl  (16): ...1, Year, Latitude, Longitude, YieldTrt, YieldUntrt, Mildew, Ye...
    #> date  (1): Date
    #> 
    #> i Use `spec()` to retrieve the full column specification for this data.
    #> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
    
    set.seed(123)
    
    # Block bootstrapping
    boots <- ww_wt %>% 
      ungroup() %>% 
      select(Year, Month, gap, temp) %>%
      pivot_wider(names_from = Month, values_from = temp, values_fn = mean) %>% 
      bootstraps(times = 10, apparent = FALSE) %>% 
      mutate(splits = map(splits, analysis)) %>%
      unnest(splits) %>% 
      group_by(id) %>% 
      mutate(row = row_number()) %>% 
      pivot_longer(names_to = "Month", values_to = "temp", cols = September:August)
    
    # Bootstraps
    boots %>%
      group_by(id) %>% 
      ggplot(aes(x = id, y = row, fill = Year)) +
      geom_tile() +
      scale_fill_viridis(option = "B", direction = 1) +
      labs(x = NULL) +
      facet_wrap(~Month) +
      theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))
    

    The above plots clearly indicates that, within each bootstrap, data from each Year is being randomly sampled with replacement.

    Created on 2022-01-08 by the reprex package (v2.0.1)