I have a custom S3 class which is for all intents and purposes a data.frame
that has an attribute which has a value for each data frame row. For example,
my_df <- data.frame("a" = c(1, 2, 3), "b" = c(4, 5, 6))
class(my_df) <- c("foo", "data.frame")
attributes(my_df)$"row_notes" <- list("good", "bad", "good")
I have defined a subsetting function for my custom class as follows:
`[.foo` <- function(x, i, j, ...) {
result <- NextMethod()
if (!missing(i)) {
attributes(result)$"row_notes" <- attributes(x)$"row_notes"[i]
} else {
attributes(result)$"row_notes" <- attributes(x)$"row_notes"
}
return(result)
}
This seems to work just fine so far. If I do something like my_df[1:2, 2]
, the "row_notes"
attribute will get subsetted down to just c("good", "bad")
.
I would like to be able to make use of dplyr::select()
and functions like dplyr::everything()
and dplyr::all_of()
on my custom data frame, but I can't seem to use these functions without destroying the "row_notes"
attribute.
Intriguingly, the result of:
attributes(dplyr::select(my_df, dplyr::everything()))$"row_notes"
Is a two element list. I tried updating my subsetting function to debug what may be happening:
`[.foo` <- function(x, i, j, ...) {
if (!missing(i)) {
cat("i:", i, "\n")
}
if (!missing(j)) {
cat("j:", j, "\n")
}
cat("-------\n")
result <- NextMethod()
if (!missing(i)) {
attributes(result)$"row_notes" <- attributes(x)$"row_notes"[i]
} else {
attributes(result)$"row_notes" <- attributes(x)$"row_notes"
}
return(result)
}
And it turns out that for whatever reason, calling dplyr::select(df, dplyr::everything())
will at some point call [.foo
with the an i
value equal to the number of columns of the data frame, not the number of rows.
Is there a way to work around this that doesn't require telling users just not to try using dplyr::select
on this essentially data frame-like object?
To be clear, I'd simply like to have a subsetting function where whatever subsetting happens to the rows of the data frame is correspondingly applied to the "row_notes"
attribute. I will also add that this is a toy example, if it were not I would probably just keep "row_notes"
as another column in the data frame - I cannot do this in my actual work as "row_notes"
represents a variety of object types including a list of matrices.
Any help would be greatly appreciated, thank you!
The main issue is that you need the [.foo
data frame method to handle 1d subsetting (e.g. my_df[1]
and my_df["a"]
). You can see that your method currently fails in these cases and this is also why dplyr::select()
fails:
library(dplyr)
my_df <- data.frame("a" = c(1, 2, 3), "b" = c(4, 5, 6))
class(my_df) <- c("foo", "data.frame")
# changed below to make unique and to a vector for prettier printing
attributes(my_df)$"row_notes" <- c("good", "bad", "worst")
my_df[1] |> attr("row_notes")
# [1] "good"
my_df["a"] |> attr("row_notes")
# NULL
It will work if it's rewritten to check the number of arguments and that i
is not missing:
`[.foo` <- function(x, i, j, ...) {
result <- NextMethod()
if (nargs() == 2 && !missing(i)) {
attr(result, "row_notes") <- attr(x, "row_notes")
} else {
attr(result, "row_notes") <- attr(x, "row_notes")[i]
}
return(result)
}
Now it should work for base subsetting and with dplyr::select()
my_df[1] |> attr("row_notes")
# [1] "good" "bad" "worst"
my_df["a"] |> attr("row_notes")
# [1] "good" "bad" "worst"
my_df[1:2, ] |> attr("row_notes")
# [1] "good" "bad"
my_df |> select(a) |> attr("row_notes")
# [1] "good" "bad" "worst"
However, because many dplyr functions strip off custom classes and attributes, if you want to use other dplyr verbs you need to provide methods for them. See help("dplyr_extending")
. Providing a foo method for dplyr_row_slice
will get arrange()
, filter()
, slice()
(and the rest of the slice_*()
family), semi_join()
, and anti_join()
working. You also need to provide methods for group_by()
and ungroup()
if you want to use these.
restore_foo <- function(data) {
class(data) <- union("foo", class(data))
data
}
dplyr_row_slice.foo <- function(data, i, ...) {
result <- NextMethod()
attr(result, "row_notes") <- attr(data, "row_notes")[i]
restore_foo(result)
}
group_by.foo <- function(.data, ...) {
restore_foo(NextMethod())
}
ungroup.foo <- function(x, ...) {
result <- NextMethod()
attributes(result) <- attributes(x)
restore_foo(result)
}
Testing:
my_df |> slice(c(1, 3)) |> attr("row_notes")
# [1] "good" "worst"
my_df |> mutate(grp = a == 2) |> filter(row_number() == 1, .by = grp) |> attr("row_notes")
# [1] "good" "bad"
my_df |> group_by(a == 2) |> filter(row_number() == 1) |> ungroup() |> attr("row_notes")
# [1] "good" "bad"
my_df |> arrange(desc(a)) |> attr("row_notes")
# [1] "worst" "bad" "good"
my_df |> group_split(a == 2) |> lapply(attr, "row_notes")
# [[1]]
# [1] "good" "worst"
#
# [[2]]
# [1] "bad"
If you need more dplyr functions to work with your class you may need to provide further methods.