rdataframefunctioncbind

How do I bind/combine selected columns from multiple dataframes into one


I have many data.frames (448) with the same columns names (9 in total), like this:

       V1              V2      V4    ... V9
ENSG00000000003.15   TSPAN6   7095
ENSG00000000005.6     TNMD    4355
       .                .       .
       .                .       .

i want to create another data.frame, maintaining the first 2 columns (V1 and V2), these are identical in every data.frame, and merge V4 column (these are different for every data.frame) from all data.frames, the rest of the columns should be excluded.

If it could be possible, I would like to rename the V4 columns to "sample1", "sample2" and so on, until 448.

So the final dataframe should be:

       V1              V2     V4_1    V4_2 ... V4_448
ENSG00000000003.15   TSPAN6   7095    3856       .
ENSG00000000005.6     TNMD    4355    2976       .
       .                .       .      .         . 
       .                .       .      .         .

I already did this:

reader <- function(f){
  read.table(f, sep='\t', skip=6, header=FALSE)
}

files <- list.files(path, 
                    recursive=TRUE, full.names=TRUE)

myfilelist <- lapply(files, reader)

But I don't know how to combine only these selected columns

This is the output of dput(lapply(myfilelist[1:2], head)):

myfilelist <- list(structure(list(V1 = c("ENSG00000000003.15", "ENSG00000000005.6", 
"ENSG00000000419.13", "ENSG00000000457.14", "ENSG00000000460.17", 
"ENSG00000000938.13"), V2 = c("TSPAN6", "TNMD", "DPM1", "SCYL3", 
"C1orf112", "FGR"), V3 = c("protein_coding", "protein_coding", 
"protein_coding", "protein_coding", "protein_coding", "protein_coding"
), V4 = c(7094L, 2L, 4355L, 1149L, 372L, 585L), V5 = c(3573L, 
1L, 2201L, 953L, 553L, 281L), V6 = c(3521L, 1L, 2154L, 883L, 
579L, 308L), V7 = c(59.9764, 0.052, 138.3704, 6.4018, 2.3896, 
6.6335), V8 = c(20.5827, 0.0178, 47.4859, 2.197, 0.8201, 2.2765
), V9 = c(22.2037, 0.0192, 51.2256, 2.37, 0.8847, 2.4558)), row.names = c(NA, 
6L), class = "data.frame"), structure(list(V1 = c("ENSG00000000003.15", 
"ENSG00000000005.6", "ENSG00000000419.13", "ENSG00000000457.14", 
"ENSG00000000460.17", "ENSG00000000938.13"), V2 = c("TSPAN6", 
"TNMD", "DPM1", "SCYL3", "C1orf112", "FGR"), V3 = c("protein_coding", 
"protein_coding", "protein_coding", "protein_coding", "protein_coding", 
"protein_coding"), V4 = c(2616L, 23L, 3746L, 1288L, 510L, 1578L
), V5 = c(1369L, 9L, 1876L, 1015L, 681L, 797L), V6 = c(1250L, 
14L, 1871L, 984L, 693L, 782L), V7 = c(16.8063, 0.4541, 90.4417, 
5.4531, 2.4895, 13.5969), V8 = c(4.8615, 0.1314, 26.1617, 1.5774, 
0.7201, 3.9331), V9 = c(6.0158, 0.1625, 32.3733, 1.9519, 0.8911, 
4.867)), row.names = c(NA, 6L), class = "data.frame"))

Solution

  • It's not clear to me how you want to join this list of data frames (are the first two columns identical?), but here is an option using left-join:

    library(dplyr)
    library(purrr)
    
    imap(myfilelist, \(df, i) select(df, 1:2, "sample{i}" := 4) ) |>
      reduce(left_join, by = join_by(V1, V2))
    

    If the first two columns are identical across all data frames then you could bind them together:

    library(dplyr)
    library(purrr)
    
    bind_cols(myfilelist[[1]][1:2],
              imap(myfilelist, \(df, i) select(df, "sample{i}" := 4)) |> bind_cols())