rfreadread.csvfixed-format

Combine and transpose many fixed-format dataset files quickly


What I have: ~100 txt files, each has 9 columns and >100,000 rows What I want: A combined file, with only 2 of the columns but all the rows. then this should be transposed for an output of >100,000 columns & 2 rows.

I've created the below function to go systematically through the files in a folder, pull the data I want, and then after each file, join together with the original template.

Problem: This works fine on my small test files, but when I try doing it on large files, I run into a memory allocation issue. My 8GB of RAM just isn't enough, and I assume that part of that is in how I wrote my code.

My Question: Is there a way to loop through the files and then join all at once at the end to save processing time?

Also, if this is the wrong place to put this kind of thing, what is a better forum to get input on WIP code??

##Script to pull in genotype txt files, transpose them, delete commented rows & 
## & header rows, and then put files together.

library(plyr)

## Define function
Process_Combine_Genotype_Files <- function(
        inputdirectory = "Rdocs/test", outputdirectory = "Rdocs/test", 
        template = "Rdocs/test/template.txt",
        filetype = ".txt", vars = ""
        ){

## List the files in the directory & put together their path
        filenames <- list.files(path = inputdirectory, pattern = "*.txt")
        path <- paste(inputdirectory,filenames, sep="/")


        combined_data <- read.table(template,header=TRUE, sep="\t")

## for-loop: for every file in directory, do the following
        for (file in path){

## Read genotype txt file as a data.frame
                currentfilename  <- deparse(substitute(file))
                currentfilename  <- strsplit(file, "/")
                currentfilename <- lapply(currentfilename,tail,1)

                data  <- read.table(file, header=TRUE, sep="\t", fill=TRUE)

                #subset just the first two columns (Probe ID & Call Codes)
                #will need to modify this for Genotype calls....
                data.calls  <- data[,1:2]

                #Change column names & row names
                colnames(data.calls)  <- c("Probe.ID", currentfilename)
                row.names(data.calls) <- data[,1]


## Join file to previous data.frame
                combined_data <- join(combined_data,data.calls,type="full")


## End for loop
        }
## Merge all files
        combined_transcribed_data  <- t(combined_data)
print(combined_transcribed_data[-1,-1])
        outputfile  <- paste(outputdirectory,"Genotypes_combined.txt", sep="/")        
        write.table(combined_transcribed_data[-1,-1],outputfile, sep="\t")

## End function
}

Thanks in advance.


Solution

  • Try:

    filenames <- list.files(path = inputdirectory, pattern = "*.txt")
    require(data.table)
    data_list <- lapply(filenames,fread, select = c(columns you want to keep))
    

    now you have a list of all you data. Assuming all the txt-files do have the same column-structure you can combine them via:

    data <- rbindlist(data_list)
    

    transposing data:

    t(data)
    

    (Thanks to @Jakob H for select in fread)