I'm writing a function where I want the main output to be a data frame (that can be piped to other functions), but I also want to allow users access to an informative list or vector of samples that were omitted from the final result. Are there best practices for how to go about this, or examples of functions/packages that do this well?
Currently I'm exploring returning the information as an attribute and throwing a warning informing users they can access the list with attr(resulting-df, "omitted")
Any advice would be greatly appreciated, thank you!
library(dplyr)
iris <- iris %>%
mutate(index = 1:nrow(.))
return_filtered <- function(df) {
res <- filter(df, Sepal.Length > 6)
omitted <- setdiff(iris$index, res$index)
attr(res, "omitted") <- omitted
return(res)
}
iris2 <- return_filtered(iris)
attributes(iris2)
#> $names
#> [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
#> [6] "index"
#>
#> $class
#> [1] "data.frame"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
#> [51] 51 52 53 54 55 56 57 58 59 60 61
#>
#> $omitted
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
#> [20] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
#> [39] 39 40 41 42 43 44 45 46 47 48 49 50 54 56 58 60 61 62 63
#> [58] 65 67 68 70 71 79 80 81 82 83 84 85 86 89 90 91 93 94 95
#> [77] 96 97 99 100 102 107 114 115 120 122 139 143 150
Created on 2022-04-02 by the reprex package (v2.0.1)
The question is probably a little opinion-based, but I don't think it's off-topic, since there are certainly neater and more formal ways to achieve what you want than your current method.
It's reasonable to hold the extra information as an attribute, but if you are going to do this then it is more idiomatic and extensible to create an S3 class, so that you can hide default printing of attributes, ensure your attributes are protected, and define a getter function for the attributes so that users don't have to sift through multiple attributes to get the correct one.
First, we will tweak your function to work with any data frame, and allow it to take any predicate so that it works as expected with dplyr::filter
. We also get the function to add to the returned object's class attribute, so that it returns a new S3 object which inherits from data.frame
return_filtered <- function(df, predicate) {
predicate <- rlang::enquo(predicate)
df$`..id..` <- seq(nrow(df))
res <- dplyr::filter(df, !!predicate)
filtered <- setdiff(seq(nrow(df)), res$`..id..`)
res$`..id..` <- NULL
attr(res, "filtered") <- filtered
class(res) <- c("filtered", class(df))
return(res)
}
We will define a print method so that the attributes don't show when we print our object:
print.filtered <- function(x, ...) {
class(x) <- class(x)[class(x) != "filtered"]
print(x, ...)
}
To get the filtered-out data from the attributes, we can create a new generic function that will only work on our new class:
get_filtered <- function(x) UseMethod("get_filtered")
get_filtered.default <- function(x) {
stop("'get_filtered' only works on filtered objects")
}
get_filtered.filtered <- function(x) {
attr(x, "filtered")
}
So now, when we call return_filtered
, it seems to work the same as dplyr::filter
, returning what appears to be a normal data frame:
df <- return_filtered(iris, Sepal.Length > 7)
df
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 7.1 3.0 5.9 2.1 virginica
#> 2 7.6 3.0 6.6 2.1 virginica
#> 3 7.3 2.9 6.3 1.8 virginica
#> 4 7.2 3.6 6.1 2.5 virginica
#> 5 7.7 3.8 6.7 2.2 virginica
#> 6 7.7 2.6 6.9 2.3 virginica
#> 7 7.7 2.8 6.7 2.0 virginica
#> 8 7.2 3.2 6.0 1.8 virginica
#> 9 7.2 3.0 5.8 1.6 virginica
#> 10 7.4 2.8 6.1 1.9 virginica
#> 11 7.9 3.8 6.4 2.0 virginica
#> 12 7.7 3.0 6.1 2.3 virginica
But we can get the filtered-out data from it with our get_filtered
function.
get_filtered(df)
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
#> [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
#> [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
#> [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
#> [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
#> [91] 91 92 93 94 95 96 97 98 99 100 101 102 104 105 107 109 111 112
#> [109] 113 114 115 116 117 120 121 122 124 125 127 128 129 133 134 135 137 138
#> [127] 139 140 141 142 143 144 145 146 147 148 149 150
And calling get_filtered
on a non-filtered data frame returns an informative error:
get_filtered(iris)
#> Error in get_filtered.default(iris): 'get_filtered' only works on filtered objects
Created on 2022-04-02 by the reprex package (v2.0.1)