rdplyrdata.tabletidyverse

How to select specific columns across multiple dataframes in R and then bind them into one data.frame?


I am trying to select or subset multiple data frames with different number of columns. They all contain the same columns of interest, so I am trying to make them all contain the same columns so I can then append into one data frame. I am trying to be as elegant and efficient as possible, but my codes does not seem to be working. This is what I tried to do:

Suppose I have the following data frames:

df1 <- matrix(1:12, 3,4, dimnames = list(NULL, LETTERS[1:4]))
df2 <- matrix(12:26, 3, 5, dimnames = list(NULL, LETTERS[1:5]))

df1 <- as.data.frame(df1)
df2 <- as.data.frame(df2)

I tried to subset both data frames creating a function and then using lapply. Suppose I only want to keep columns A, C, and D:

select_function <- function(x){
  dplyr::select(`A`,`C`,`D`)
}

list <- list(df1, df2)

df.list <- lapply(list, function(x) select_function)

I then tried to append the list into one data frame:

new.df <- do.call(rbind, df.list)

Codes are not working. I think the line with lapply is not correct, not sure what is being generated in df.list. I hope I could communicate what I tried to do. Please let me know alternative ways to achieve this.


Solution

  • You are not passing your data to your function. It should look like:

    select_cols <- function(df) {
        df |>
            dplyr::select(A, C, D)
    }
    

    Then you can just do:

    lapply(l, select_cols)
    # [[1]]
    #   A C  D
    # 1 1 7 10
    # 2 2 8 11
    # 3 3 9 12
    
    # [[2]]
    #    A  C  D
    # 1 12 18 21
    # 2 13 19 22
    # 3 14 20 23
    
    

    Or alternatively, in base R:

    cols <- c("A", "C", "D")
    lapply(l, \(df) df[cols])
    # [[1]]
    #   A C  D
    # 1 1 7 10
    # 2 2 8 11
    # 3 3 9 12
    
    # [[2]]
    #    A  C  D
    # 1 12 18 21
    # 2 13 19 22
    # 3 14 20 23