Does there exist any R function or packages that records the operations applied to a tibble/data frame?
For example, if I did the following
data(iris)
my_table <- iris %>% filter(Sepal.Length>6) %>% filter(Species == 'virginica')
I would want the output to be something of the form
display_filter_function(my_table)
output:
Step filter
1 sepal.length > 6
2 Species == 'virginica'
I am thinking that this would be something similar to the functionality provided by the recipes package, but not needing to use the step_ family of function
I've written a little module for you. It is a standalone resource and has only one dependency beyond base
R: namely dplyr
itself. The module is long, so I have put it at the bottom of this post. You can find the code itself under the Module section, and its usage is demonstrated under the Usage section.
This model could theoretically be extended to all dplyr
functions, and to other generic functions as well. To keep things manageable, I myself have implemented it for dplyr::filter()
alone.
This module leverages the R concept of generic methods, like print()
and format()
and mean()
and summary()
. Suppose you wish to print()
a data.frame
object. The generic print()
function...
print
#> function (x, ...)
#> UseMethod("print")
#> <bytecode: 0x000002429186e2c8>
#> <environment: namespace:base>
...does not do the work itself! Rather, it dispatches to some print.*()
method, via the line:
UseMethod("print")
Now the native data.frame
class has its own special print()
method called print.data.frame()
.
print.data.frame
#> function (x, ..., digits = NULL, quote = FALSE, right = TRUE, row.names = TRUE, max = NULL)
#> {
#> n <- length(row.names(x))
#> ⋮
#> invisible(x)
#> }
#> <bytecode: 0x000002429186b7e0>
#> <environment: namespace:base>
So when UseMethod()
seeks a matching ("print"
) method, it finds print.data.frame()
ready and waiting! It is the print.data.frame()
function that actually handles the printing for the data.frame
.
More generally, a generic function like fn()
...
fn <- function(x, ...) {
UseMethod("fn")
}
can be implemented for a (S3) class like cls
, with a function of the form fn.cls()
:
fn.cls <- function(x, arg_1, arg_2, arg_3, ...) {
# ...
}
The fn.default()
method handles fn()
for unimplemented classes. So in the absence of a print.cls()
function, then UseMethod()
would dispatch a cls
object to print.default()
:
print.default
#> function (x, digits = NULL, quote = TRUE, na.print = NULL, print.gap = NULL, right = FALSE, max = NULL, width = NULL, useSource = TRUE, ...)
#> {
#> args <- pairlist(digits = digits, quote = quote, na.print = na.print, ...
#> ⋮
#> .Internal(print.default(x, args, missings))
#> }
#> <bytecode: 0x0000024291917b80>
#> <environment: namespace:base>
By defining a custom S3 class called hst_obj
— "historical object" — I override the "generic" behavior of dplyr::filter()
...
dplyr::filter
#> function (.data, ..., .preserve = FALSE)
#> {
#> UseMethod("filter")
#> }
#> <bytecode: 0x0000024292d10b40>
#> <environment: namespace:dplyr>
...which is designed to dispatch via UseMethod("filter")
. To that end, I implement the function filter.hst_obj()
:
filter.hst_obj
#> function (.data, ..., .preserve = FALSE)
#> {
#> .update_hst(x = `class<-`(dplyr::filter(.data = un_hst_obj(.data, ...
#> }
#> <bytecode: 0x000002428f842958>
When you call dplyr::filter()
on a hst_obj
object, then filter.hst_obj()
jumps into action! Whenever it filters the object, it also records the filtration criteria in the special attribute obj_hst
, which maintains the "object history".
This history is a tibble
...
# A tibble: m × 4
step order expr text
<int> <int> <list> <chr>
1 1 1 <language> sepal.length > 6
⋮ ⋮ ⋮ ⋮ ⋮
...which has four columns:
step
: The filter()
step in the workflow.iris %>% # step
filter(Sepal.Length > 6) %>% # } 1
filter(Species == 'virginica') %>% # } 2
... # ⋮
order
: The criterion within the filter()
step. filter(a < 10, b == 3 | c > 5, ...)
# |----| |------------|
# order: 1 2 ...
expr
: The actual code (language
) for the criterion (Sepal.Length > 6
), useful for programmatic manipulation of R.text
: A textual (character
) representation of that code ("Sepal.Length > 6"
), for visual clarity.You'll want to load dplyr
itself, and then source()
the module (mod.R
) from (say) your working directory.
# Load the `dplyr` package...
library(dplyr)
# ...along with the `hst_obj` functions from the module:
source("./mod.R")
The modular function filter.hst_obj()
must be loaded into the same workspace where you use dplyr::filter()
. Per the documentation
UseMethod
...search[es] for methods in two places: in the environment in which the generic function is called, and in the registration data base for the environment in which the generic is defined (typically a namespace). So methods for a generic function need to be available in the environment of the call to the generic, or they must be registered.
Here is a simple workflow on the iris
dataset.
iris %>%
filter(Sepal.Length > 7, Sepal.Width <= 3) %>%
filter(Petal.Width > 2)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 7.1 3.0 5.9 2.1 virginica
#> 2 7.6 3.0 6.6 2.1 virginica
#> 3 7.7 2.6 6.9 2.3 virginica
#> 4 7.7 3.0 6.1 2.3 virginica
Now we transform the dataset into a "historical object" called iris_hst
, via as_hst_obj()
.
iris_hst <- as_hst_obj(iris)
Per is_hst_obj()
, it is indeed a historical object.
iris_hst %>% is_hst_obj()
#> TRUE
However, its history via get_hst()
is still blank.
iris_hst %>% get_hst()
#> # A tibble: 0 × 4
#> # … with 4 variables: step <int>, order <int>, expr <list>, text <chr>
We now perform the same workflow on the historical dataset iris_hst
...
iris_hst <- iris_hst %>%
filter(Sepal.Length > 7, Sepal.Width <= 3) %>%
filter(Petal.Width > 2)
...which yields a consistent output.
iris_hst
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 7.1 3.0 5.9 2.1 virginica
#> 2 7.6 3.0 6.6 2.1 virginica
#> 3 7.7 2.6 6.9 2.3 virginica
#> 4 7.7 3.0 6.1 2.3 virginica
Crucially, we can now access the history via get_hst()
:
iris_hst %>% get_hst()
#> # A tibble: 3 × 4
#> step order expr text
#> <int> <int> <list> <chr>
#> 1 1 1 <language> Sepal.Length > 7
#> 2 1 2 <language> Sepal.Width <= 3
#> 3 2 1 <language> Petal.Width > 2
We can also "reset" the history via reset_hst()
, which clears the tibble
of historical data.
iris_hst <- iris_hst %>% reset_hst()
iris_hst %>% get_hst()
#> # A tibble: 0 × 4
#> # … with 4 variables: step <int>, order <int>, expr <list>, text <chr>
Finally, we can revert to an "unhistorical" object via un_hst_obj()
, which removes the hst_obj
classification and deletes the obj_hst
attribute:
iris_unhst <- iris_hst %>% un_hst_obj()
# It is no longer a "historical" object...
iris_unhst %>% is_hst_obj()
#> FALSE
# ...and the history is nonexistent (not merely blank) entirely.
iris_unhst %>% get_hst()
#>
Here is the module. I recommend saving it locally, as (say) mod.R
in (say) your working directory. I also recommend the box
package, which can load such modules painlessly via box::use
(./mod)
.
#########
## API ##
#########
# Test if an object is "historical object" whose filtrations are recorded.
is_hst_obj <- function(x) {
inherits(x, .HST_OBJ_CLASS)
}
# Treat an object as "historical".
as_hst_obj <- function(x) {
if (!is_hst_obj(x)) {
class(x) <- c(.HST_OBJ_CLASS, class(x))
}
x
}
# Erase the "historicity" of an object.
un_hst_obj <- function(x, hst = TRUE) {
if (is_hst_obj(x)) {
org_class <- class(x)
class(x) <- org_class[org_class != .HST_OBJ_CLASS]
if (isTRUE(hst)) {
x <- .set_hst(x, hst = NULL)
}
}
x
}
# Get the history from a historical object.
get_hst <- function(x) {
hst <- attr(x, .OBJ_HST_ATTR)
if (is.null(hst)) {
if (is_hst_obj(x)) {
.BLANK_OBJ_HST
# NULL
} else {
invisible(NULL)
}
} else {
hst
}
}
# Reset the history for a historical object.
reset_hst <- function(x) {
if (is_hst_obj(x)) {
x <- .set_hst(x, hst = NULL)
}
x
}
##############
## Dispatch ##
##############
# Dispatch filtration for historical objects.
filter.hst_obj <- evalq(envir = new.env(), {
# Define the filtration function: `dplyr::filter()`
fn_expr <- quote(dplyr::filter)
# ^^^^^^^^^^^^^
# UPDATE HERE
fn <- eval(fn_expr)
# Replicate in our result the signature of that original function.
arg_syms <- as.list(args(fn))
arg_syms <- utils::head(arg_syms, n = -1)
arg_syms <- sapply(names(arg_syms), as.symbol, USE.NAMES = TRUE)
# Prepare the elements for the function body...
obj_sym <- arg_syms[[1]] # The (1st) argument (.data) for the object...
cnd_exprs <- arg_syms$... # ...and dots (...) for filtration condition(s).
# ...including a similar call to the filter with an "ahistorical" object...
arg_syms[[as.character(obj_sym)]] <- substitute(un_hst_obj(
obj_sym,
hst = FALSE
))
fn_call <- as.call(c(list(fn_expr), arg_syms))
sub_list <- list(
obj = obj_sym,
cnd = cnd_exprs,
cll = fn_call
)
# ...and assemble those elements.
fn_body <- substitute(env = sub_list, quote({
.update_hst(
# Perform the unclassed call and then restore any "historicity"...
x = `class<-`(cll, class(obj)),
# ...and then update the history with the filtration criteria.
exprs = match.call(expand.dots = FALSE)$cnd
)
}))
# Pair this body with the header from `dplyr::filter()`...
fn_body <- eval(fn_body)
body(fn) <- fn_body
# ...and transfer the resulting function to the calling environment.
environment(fn) <- parent.frame(n = 2)
# Return the resulting function.
fn
})
#############
## Support ##
#############
# Labels for the object class...
.HST_OBJ_CLASS <- "hst_obj"
# ...and its history attribute.
.OBJ_HST_ATTR <- "obj_hst"
# The default history for an object.
.BLANK_OBJ_HST <- dplyr::tibble(
step = integer(),
order = integer(),
expr = list(),
text = character()
)
# Set the history for a historical object.
.set_hst <- function(x, hst) {
attr(x, .OBJ_HST_ATTR) <- hst
x
}
# Update the history with a list of filtration expressions.
.update_hst <- function(x, exprs) {
# Augment the history of a "historical" object.
if (is_hst_obj(x)) {
# Get the current history.
hst <- get_hst(x)
# # ...and default if the history is missing.
# if (is.null(hst)) {
# hst <- .BLANK_OBJ_HST
# }
# Augment the history: format the new additions...
next_cnd <- exprs
# next_cnd <- sapply(next_cnd, as.expression, simplify = FALSE)
next_txt <- sapply(next_cnd, deparse, simplify = TRUE)
next_ord <- seq_along(next_cnd)
if (length(exprs) == 0) {
next_stp <- integer()
} else if (nrow(hst) == 0) {
next_stp <- 1
} else {
next_stp <- max(hst$step) + 1
}
next_hst <- dplyr::tibble(
step = as.integer(next_stp),
order = as.integer(next_ord),
expr = as.list(next_cnd),
text = as.character(next_txt)
)
# ...and append them to the existing history.
hst <- dplyr::bind_rows(hst, next_hst)
# Update the history.
x <- .set_hst(x, hst = hst)
}
# Return the updated object.
x
}