rlistfunctionrecursion

Recursively Converting a Data Frame to a Nested List Where the Level of Nestedness of the List Equals the Number of Columns in the Data Frame


I have the following data frame.

Data_Frame <- structure(list(Factor_1 = c("AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA", "BB", "BB", "BB", "BB", "BB", "BB", "BB", "BB", "BB", "BB", "BB", "BB", "BB", "BB", "BB", "CC", "CC", "CC", "CC", "CC", "CC", "CC", "CC", "CC", "CC", "CC", "CC", "CC", "CC", "CC", "DD", "DD", "DD", "DD", "DD", "DD", "DD", "DD", "DD", "DD", "DD", "DD", "DD", "DD", "DD"), Factor_2 = c("aa", "aa", "aa", "bb", "bb", "bb", "cc", "cc", "cc", "dd", "dd", "dd", "ee", "ee", "ee", "aa", "aa", "aa", "bb", "bb", "bb", "cc", "cc", "cc", "dd", "dd", "dd", "ee", "ee", "ee", "aa", "aa", "aa", "bb", "bb", "bb", "cc", "cc", "cc", "dd", "dd", "dd", "ee", "ee", "ee", "aa", "aa", "aa", "bb", "bb", "bb", "cc", "cc", "cc", "dd", "dd", "dd", "ee", "ee", "ee"), Factor_3 = c("xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz", "xxx", "yyy", "zzz")), class = "data.frame", row.names = c(NA, -60L))

I want to write a recursive function that will split this data frame into a nested list. The output should look like the following object.

Split_Data <- lapply(lapply(split(Data_Frame, Data_Frame[, 1]), function (x) {
  split(x, x[, 2])
}), function (x) {
  lapply(x, function (y) {
    split(y, y[, 3])
  })
})

In other words, the data frame should be split first by the value of the first column, then by the value in the second column, and so on and so forth until all the columns have been used to split the data into smaller and smaller data frames (the list becomes more and more nested with each split).

In this example, there are only three columns in the data frame, but in practice, there could be any number of columns, and so I'd like a recursive function to be able to handle any number of columns.

base R solutions are preferred.

Thanks!


Solution

  • Here's a simple recursive function to split a data frame sequentially by column. Be warned that this will perform poorly as the number of columns and also distinct values within columns increases.

    recursive_split <- function(data, n = 1) {
      if (n > ncol(data)) return(data)
      lapply(split(data, data[[n]]), recursive_split, n + 1)
    }
    
    res <- recursive_split(Data_Frame)
    
    identical(res, Split_Data)
    [1] TRUE
    

    A more performant and flexible recursive split function is available in the collapse package:

    collapse::rsplit(Data_Frame, Data_Frame, keep.by = TRUE)