rdataframeoopdplyr

Subsetting attributes of custom S3 class during dplyr::select


I have a custom S3 class which is for all intents and purposes a data.frame that has an attribute which has a value for each data frame row. For example,

my_df <- data.frame("a" = c(1, 2, 3), "b" = c(4, 5, 6))

class(my_df) <- c("foo", "data.frame")

attributes(my_df)$"row_notes" <- list("good", "bad", "good")

I have defined a subsetting function for my custom class as follows:

`[.foo` <- function(x, i, j, ...) {
    result <- NextMethod()
    if (!missing(i)) {
        attributes(result)$"row_notes" <- attributes(x)$"row_notes"[i]
    } else {
        attributes(result)$"row_notes" <- attributes(x)$"row_notes"
    }
    return(result)
}

This seems to work just fine so far. If I do something like my_df[1:2, 2], the "row_notes" attribute will get subsetted down to just c("good", "bad").

I would like to be able to make use of dplyr::select() and functions like dplyr::everything() and dplyr::all_of() on my custom data frame, but I can't seem to use these functions without destroying the "row_notes" attribute.

Intriguingly, the result of:

attributes(dplyr::select(my_df, dplyr::everything()))$"row_notes"

Is a two element list. I tried updating my subsetting function to debug what may be happening:

`[.foo` <- function(x, i, j, ...) {
    if (!missing(i)) {
        cat("i:", i, "\n")
    }
    if (!missing(j)) {
        cat("j:", j, "\n")
    }
    cat("-------\n")
    result <- NextMethod()
    if (!missing(i)) {
        attributes(result)$"row_notes" <- attributes(x)$"row_notes"[i]
    } else {
        attributes(result)$"row_notes" <- attributes(x)$"row_notes"
    }
    return(result)
}

And it turns out that for whatever reason, calling dplyr::select(df, dplyr::everything()) will at some point call [.foo with the an i value equal to the number of columns of the data frame, not the number of rows.

Is there a way to work around this that doesn't require telling users just not to try using dplyr::select on this essentially data frame-like object?

To be clear, I'd simply like to have a subsetting function where whatever subsetting happens to the rows of the data frame is correspondingly applied to the "row_notes" attribute. I will also add that this is a toy example, if it were not I would probably just keep "row_notes" as another column in the data frame - I cannot do this in my actual work as "row_notes" represents a variety of object types including a list of matrices.

Any help would be greatly appreciated, thank you!


Solution

  • The main issue is that you need the [.foo data frame method to handle 1d subsetting (e.g. my_df[1] and my_df["a"]). You can see that your method currently fails in these cases and this is also why dplyr::select() fails:

    library(dplyr)
    
    my_df <- data.frame("a" = c(1, 2, 3), "b" = c(4, 5, 6))
    class(my_df) <- c("foo", "data.frame")
    # changed below to make unique and to a vector for prettier printing
    attributes(my_df)$"row_notes" <- c("good", "bad", "worst") 
    
    my_df[1] |> attr("row_notes")
    # [1] "good"
    my_df["a"] |> attr("row_notes")
    # NULL
    

    It will work if it's rewritten to check the number of arguments and that i is not missing:

    `[.foo` <- function(x, i, j, ...) {
      result <- NextMethod()
      if (nargs() == 2 && !missing(i)) {
        attr(result, "row_notes") <- attr(x, "row_notes")
      }  else {
        attr(result, "row_notes") <- attr(x, "row_notes")[i]
      } 
      return(result)
    }
    

    Now it should work for base subsetting and with dplyr::select()

    my_df[1] |> attr("row_notes")
    # [1] "good"  "bad"   "worst"
    my_df["a"] |> attr("row_notes")
    # [1] "good"  "bad"   "worst"
    my_df[1:2, ] |> attr("row_notes")
    # [1] "good" "bad" 
    my_df |> select(a) |> attr("row_notes")
    # [1] "good"  "bad"   "worst"
    

    However, because many dplyr functions strip off custom classes and attributes, if you want to use other dplyr verbs you need to provide methods for them. See help("dplyr_extending"). Providing a foo method for dplyr_row_slice will get arrange(), filter(), slice() (and the rest of the ⁠slice_*()⁠ family), semi_join(), and anti_join() working. You also need to provide methods for group_by() and ungroup() if you want to use these.

    restore_foo <- function(data) {
      class(data) <- union("foo", class(data))
      data
    }
    
    dplyr_row_slice.foo <- function(data, i, ...) {
      result <- NextMethod()
      attr(result, "row_notes") <- attr(data, "row_notes")[i]
      restore_foo(result)
    }
    
    group_by.foo <- function(.data, ...) {
      restore_foo(NextMethod())
    }
    
    ungroup.foo <- function(x, ...) {
      result <- NextMethod()
      attributes(result) <- attributes(x)
      restore_foo(result)
    }
    

    Testing:

    my_df |> slice(c(1, 3)) |> attr("row_notes")
    # [1] "good"  "worst"
    my_df |> mutate(grp = a == 2) |> filter(row_number() == 1, .by = grp) |> attr("row_notes")
    # [1] "good" "bad" 
    my_df |> group_by(a == 2) |> filter(row_number() == 1) |> ungroup() |> attr("row_notes")
    # [1] "good" "bad" 
    my_df |> arrange(desc(a)) |>  attr("row_notes")
    # [1] "worst" "bad"   "good" 
    my_df |> group_split(a == 2) |> lapply(attr, "row_notes")
    # [[1]]
    # [1] "good"  "worst"
    # 
    # [[2]]
    # [1] "bad"
    

    If you need more dplyr functions to work with your class you may need to provide further methods.