rfunctiontidyversepackage-development

Returning data frame as main result but also informative list as side effect


I'm writing a function where I want the main output to be a data frame (that can be piped to other functions), but I also want to allow users access to an informative list or vector of samples that were omitted from the final result. Are there best practices for how to go about this, or examples of functions/packages that do this well?

Currently I'm exploring returning the information as an attribute and throwing a warning informing users they can access the list with attr(resulting-df, "omitted")

Any advice would be greatly appreciated, thank you!

library(dplyr)

iris <- iris %>%
  mutate(index = 1:nrow(.))

return_filtered <- function(df) {

  res <- filter(df, Sepal.Length > 6)
  omitted <- setdiff(iris$index, res$index)

  attr(res, "omitted") <- omitted
  return(res)

}

iris2 <- return_filtered(iris)
attributes(iris2)
#> $names
#> [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
#> [6] "index"       
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#> [51] 51 52 53 54 55 56 57 58 59 60 61
#> 
#> $omitted
#>  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19
#> [20]  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38
#> [39]  39  40  41  42  43  44  45  46  47  48  49  50  54  56  58  60  61  62  63
#> [58]  65  67  68  70  71  79  80  81  82  83  84  85  86  89  90  91  93  94  95
#> [77]  96  97  99 100 102 107 114 115 120 122 139 143 150

Created on 2022-04-02 by the reprex package (v2.0.1)


Solution

  • The question is probably a little opinion-based, but I don't think it's off-topic, since there are certainly neater and more formal ways to achieve what you want than your current method.

    It's reasonable to hold the extra information as an attribute, but if you are going to do this then it is more idiomatic and extensible to create an S3 class, so that you can hide default printing of attributes, ensure your attributes are protected, and define a getter function for the attributes so that users don't have to sift through multiple attributes to get the correct one.

    First, we will tweak your function to work with any data frame, and allow it to take any predicate so that it works as expected with dplyr::filter. We also get the function to add to the returned object's class attribute, so that it returns a new S3 object which inherits from data.frame

    return_filtered <- function(df, predicate) {
      predicate    <- rlang::enquo(predicate)
      df$`..id..`  <- seq(nrow(df))
      res          <- dplyr::filter(df, !!predicate)
      filtered     <- setdiff(seq(nrow(df)), res$`..id..`)
      res$`..id..` <- NULL
      
      attr(res, "filtered") <- filtered
      class(res)            <- c("filtered", class(df))
      
      return(res)
    }
    

    We will define a print method so that the attributes don't show when we print our object:

    print.filtered <- function(x, ...) {
      class(x) <- class(x)[class(x) != "filtered"]
      print(x, ...)
    }
    

    To get the filtered-out data from the attributes, we can create a new generic function that will only work on our new class:

    get_filtered <- function(x) UseMethod("get_filtered")
    
    get_filtered.default <- function(x) {
      stop("'get_filtered' only works on filtered objects")
    }
    
    get_filtered.filtered <- function(x) {
      attr(x, "filtered")
    }
    

    So now, when we call return_filtered, it seems to work the same as dplyr::filter, returning what appears to be a normal data frame:

    df <- return_filtered(iris, Sepal.Length > 7)
    
    df
    #>    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
    #> 1           7.1         3.0          5.9         2.1 virginica
    #> 2           7.6         3.0          6.6         2.1 virginica
    #> 3           7.3         2.9          6.3         1.8 virginica
    #> 4           7.2         3.6          6.1         2.5 virginica
    #> 5           7.7         3.8          6.7         2.2 virginica
    #> 6           7.7         2.6          6.9         2.3 virginica
    #> 7           7.7         2.8          6.7         2.0 virginica
    #> 8           7.2         3.2          6.0         1.8 virginica
    #> 9           7.2         3.0          5.8         1.6 virginica
    #> 10          7.4         2.8          6.1         1.9 virginica
    #> 11          7.9         3.8          6.4         2.0 virginica
    #> 12          7.7         3.0          6.1         2.3 virginica
    

    But we can get the filtered-out data from it with our get_filtered function.

    get_filtered(df)
    #>   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
    #>  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
    #>  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
    #>  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
    #>  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
    #>  [91]  91  92  93  94  95  96  97  98  99 100 101 102 104 105 107 109 111 112
    #> [109] 113 114 115 116 117 120 121 122 124 125 127 128 129 133 134 135 137 138
    #> [127] 139 140 141 142 143 144 145 146 147 148 149 150
    

    And calling get_filtered on a non-filtered data frame returns an informative error:

    get_filtered(iris)
    #> Error in get_filtered.default(iris): 'get_filtered' only works on filtered objects
    

    Created on 2022-04-02 by the reprex package (v2.0.1)