rloopsfor-looprstudioauto-populate

filling two vectors from list.files() function;


maybe is a naive question, I downloaded a bunch of assemblies for different organisms with the following structure;

-Parent_folder
--Genus_species_1
---genome_filename_1
---genome_filename_2
---genome_filename_n
--Genus_species_2
---genome_filename_1
---genome_filename_2
---genome_filename_n
--Genus_species_N
---genome_filename_1
---genome_filename_2
---genome_filename_n

I would like to make a table, with one column with the species name and a second column with the filename of the assembly. Something like this;

    colum1      |     column2
Genus_species_1 | genome_filename_1
Genus_species_1 | genome_filename_2
Genus_species_1 | genome_filename_n
Genus_species_2 | genome_filename_1
Genus_species_2 | genome_filename_2
Genus_species_2 | genome_filename_n
Genus_species_N | genome_filename_1
Genus_species_N | genome_filename_2
Genus_species_N | genome_filename_n

I tried many things, and I don't know what it's wrong whit this code;

#listing the folders containing different number of genomes;
folder_list<- list.dirs(".", full.names = FALSE)

#Remove the parent folder;
folder_list<- folder_list[-1]

#Creating two vectors to populate with the genome filename and another with the species name(same as folder name);
genomes<-NULL
species<- NULL

#Generate a loop to populate;
for (dir in 1:length(folder_list)){
  files<- as.vector(list.files(file.path(WD, dir))) #Vector containing all the genome filenames
  genomes<- c(genomes, files) #add the one before to the genomes vector

  #next, create a vector with the number of the folder(which is the species) and repeat it as much as the number of genomes;
  directories<-rep(dir, length(list.files(file.path(WD, dir))))
  species<- append(species, directories) #add it to species vector

} #end of the loop

Hope someone can help!

Thank you in advance!


Solution

  • I recreated the directory structure and then created a data.frame where column 1 lists all child directories and column 2 lists the files in each child directory.

    First, here's the code written to replicate the directory structure.

    # Create parent folder
    dir.create("Parent_folder")
    
    # Create child directories and files in one go
    sample_dirnames <- seq_len(3)
    sapply(sample_dirnames, function(index){
    
        # build dir path and create
        dirname <- paste0("Genus_species_", sample_dirnames[index])
        dir.create(paste0("Parent_folder/",dirname))
    
        # generate three files in the current directory
        sapply(seq_len(3), function(n) {
            file.create(paste0("Parent_folder/", dirname, "/genome_filename_", n, ".R"))      
        })
    })
    

    Using the list.files function as a starting point, use recursive = TRUE. Then pipe the output into a data.frame and separate the paths into directories and filenames. This example uses functions from the dplyr package and base functions (substring, gregexpr).

    # pkg
    library(tidyverse)
    
    # build object
    files <- list.files("Parent_folder/", recursive = TRUE) %>%
        as.data.frame(.) %>%
        rename(., "path" = .) %>%
        mutate(
            column1 = substring(
                text = path,
                first = 1,
                last = as.numeric(
                    gregexpr(
                        pattern = "/",
                        text = path
                    )[1]
                ) - 1
            ),
            column2 = substring(
                text = path,
                first = as.numeric(
                    gregexpr(
                        pattern = "/",
                        text = path
                    )[1]
                ) + 1
            )
        ) %>%
        select(-path)
    

    This will print out the following object

    files
    #          column1             column2
    # 1 Genus_species_1 genome_filename_1.R
    # 2 Genus_species_1 genome_filename_2.R
    # 3 Genus_species_1 genome_filename_3.R
    # 4 Genus_species_2 genome_filename_1.R
    # 5 Genus_species_2 genome_filename_2.R
    # 6 Genus_species_2 genome_filename_3.R
    # 7 Genus_species_3 genome_filename_1.R
    # 8 Genus_species_3 genome_filename_2.R
    # 9 Genus_species_3 genome_filename_3.R