rdplyrresamplingpurrrmodelr

Using modelrs bootstrap in R for medians


I have found that the following works

iris %>% 
    select(Sepal.Length) %>% 
    modelr::bootstrap(100) %>% 
    mutate(mean = map(strap, mean))

but the below does not

iris %>% 
    select(Sepal.Length) %>% 
    modelr::bootstrap(100) %>% 
    mutate(median = map(strap, median)) 

The only difference is that the second line of code uses the median.

The error I get is

Error in mutate_impl(.data, dots) : Evaluation error: unimplemented type 'list' in 'greater' .

Solution

  • The code looks like it's working, but if you unnest it, you're actually just getting a lot of NAs because you're trying to take the mean of a resample object, which is a classed list with a reference to the data resampled and the indices for the particular resample. Taking the mean of such a list is not useful, so returning NA with a warning is helpful behavior. To get the code to work, coerce the resample to a data frame, which you can operate upon as usual within map's anonymous function.

    For a direct route, extract the data and take the mean, simplifying the list to a numeric vector with map_dbl:

    library(tidyverse)
    set.seed(47)
    
    iris %>% 
        select(Sepal.Length) %>% 
        modelr::bootstrap(100) %>% 
        mutate(sepal_mean = map_dbl(strap, ~mean(as_data_frame(.x)$Sepal.Length))) 
    #> # A tibble: 100 x 3
    #>             strap   .id sepal_mean
    #>            <list> <chr>      <dbl>
    #>  1 <S3: resample>   001   5.844000
    #>  2 <S3: resample>   002   6.016000
    #>  3 <S3: resample>   003   5.851333
    #>  4 <S3: resample>   004   5.869333
    #>  5 <S3: resample>   005   5.840667
    #>  6 <S3: resample>   006   5.825333
    #>  7 <S3: resample>   007   5.824000
    #>  8 <S3: resample>   008   5.790000
    #>  9 <S3: resample>   009   5.858000
    #> 10 <S3: resample>   010   5.810000
    #> # ... with 90 more rows
    

    Translating this approach to median works fine:

    iris %>% 
        select(Sepal.Length) %>% 
        modelr::bootstrap(100) %>% 
        mutate(sepal_median = map_dbl(strap, ~median(as_data_frame(.x)$Sepal.Length)))
    #> # A tibble: 100 x 3
    #>             strap   .id sepal_median
    #>            <list> <chr>        <dbl>
    #>  1 <S3: resample>   001          5.9
    #>  2 <S3: resample>   002          5.8
    #>  3 <S3: resample>   003          5.8
    #>  4 <S3: resample>   004          5.7
    #>  5 <S3: resample>   005          5.7
    #>  6 <S3: resample>   006          5.8
    #>  7 <S3: resample>   007          5.8
    #>  8 <S3: resample>   008          5.7
    #>  9 <S3: resample>   009          5.8
    #> 10 <S3: resample>   010          5.7
    #> # ... with 90 more rows
    

    If you'd like both median and mean, you could repeatedly coerce the resample to a data frame, or store it in another column, but neither approach is very efficient. It's better to return a list of data frames with map that can be unnested:

    iris %>% 
        select(Sepal.Length) %>% 
        modelr::bootstrap(100) %>% 
        mutate(stats = map(strap, ~summarise_all(as_data_frame(.x), funs(mean, median)))) %>% 
        unnest(stats)
    #> # A tibble: 100 x 4
    #>             strap   .id     mean median
    #>            <list> <chr>    <dbl>  <dbl>
    #>  1 <S3: resample>   001 5.744667   5.60
    #>  2 <S3: resample>   002 5.725333   5.70
    #>  3 <S3: resample>   003 5.808667   5.70
    #>  4 <S3: resample>   004 5.809333   5.70
    #>  5 <S3: resample>   005 5.964000   5.85
    #>  6 <S3: resample>   006 5.931333   5.95
    #>  7 <S3: resample>   007 5.838667   5.80
    #>  8 <S3: resample>   008 5.926000   5.95
    #>  9 <S3: resample>   009 5.855333   5.75
    #> 10 <S3: resample>   010 5.888667   5.70
    #> # ... with 90 more rows