rdata.tableread.csvread.tablerna-seq

Merging large TCGA .tsv files in a memory-efficient way on Posit Cloud (formerly RStudio)


I have 448 .tsv files that contain gene expression data (RNAseq) that were downloaded from The Cancer Genome Atlas Genomic Data Commons (GDC) Portal.

These files have 60666 rows and 9 columns. Then there is a metadata file that has case_id in the first column and file_name in the second column.

So what my merge_RNA function does is to take as input the metadata file and the directory where all the 448 .tsv files are. Then if the basename of the file matches the file_name, merge the files by the first three columns "gene_id", "gene_name", "gene_type" and the eighth column "fpkm_unstranded".

So the merged matrix should have 60661 rows and 451 columns whereby gene_id are the rows and the case_id are the columns (see attached images).

However, I have run into memory issues because the read.csv function reads the entire .tsv file into memory.

NOTE: I am using Posit Cloud (formerly RStudio) so I have limited computing power (only 16 GB RAM and 4 CPU).

# Define the function to merge all RNAseq quantification files into one dataframe

merge_rna <-function(metadata, fdir){
  filelist <- list.files(fdir, pattern="*.tsv$", 
                     recursive = TRUE, full.names=TRUE)
  for (i in 1:length(filelist)){
    iname <- basename(filelist[i])
    isamplename <- metadata[metadata$file_name==iname, "sample"]
    idf <- read.csv(filelist[i], sep="\t", skip=1, header=TRUE)
    # remove first 4 rows
    remove <- 1:4
    idf_subset <- idf[-remove, c("gene_id", "gene_name", "gene_type", "fpkm_unstranded")]
    rm(idf)
    names(idf_subset)[4] <- isamplename
    #print(dim(idf_subset))
    if (i==1){
      combined_df <- idf_subset
      rm(idf_subset)
    } else {
      combined_df <- merge(combined_df, idf_subset, by.x='gene_id', by.y="gene_id", all=TRUE)
      rm(idf_subset)
    }
  }

# use gene_id as row names and remove gene_id column
rownames(combined_df) <- combined_df$gene_id
combined_df <- combined_df[,-which(names(combined_df) %in% c("gene_id"))]
return(combined_df)
}

rnaCounts <-  merge_rna(GDC_metadata_file, "/home/r1816512/TSV_subset")

The code I have provided seems to work but it uses a lot of memory. Please tell me what can be done to use less memory. Perhaps we don't have to read the entire file into memory?

I tried ChatGPT but despite all my best efforts at fine-tuning it to re-write my code and give me a version that uses less memory, all the code it gave me produced errors. That meme about ChatGPT producing code in less than 5 minutes but debugging that code takes 24 hours is indeed true.

ChatGPT recommended that one way to reduce memory usage is to read in the TSV files in chunks rather than all at once. This can be done using the readr package, specifically the read_tsv_chunked() function.

ChatGPT also recommended that an alternative way to use less memory is to process the data in chunks instead of reading the entire file into memory using the data.table package and processes the data in chunks.

However, none of the code it gave me that use both the readr and the data.table functions worked.

Here are the images of the metadata file and how the merged expression matrix should look like to help with answering the question.

NOTE: There was a tool called The GDC RNASeq Tool which was used to perform this exact task, i.e., download individual RNASeq files from the GDC Data Portal using a manifest file, merge into a single matrix identified by TCGA barcode/Case ID but it was DEPRECATED when the transcriptome profiling/gene expression quantification bioinformatic workflow changed from using HTSeq to STAR. https://github.com/cpreid2/gdc-rnaseq-tool https://gdc.cancer.gov/content/gdc-rnaseq-tool

It would be nice if someone knowledgeable enough could develop a new tool for this task to replace the gdc-rnaseq-tool.

GDC_metadata_file merged_fpkm_unstranded_expression_matrix


Solution

  • Use data.table::fread() to pull in each tsv, selecting only the columns you need, use rbindlist(), merge once back onto the metaMatrix to get the sample id, and then (optionally) dcast

    # get vector of file names (full names of tsv files)
    files <- list.files("../gdc", pattern="*.tsv", full.names = T, recursive = T)
    # create vector of target columns
    targ_cols <- c("gene_id", "gene_name", "gene_type","fpkm_unstranded")
    
    # read each file, selecting the columns of interest
    result <- lapply(files, \(f) fread(f,header=F, skip=6, select = c(1,2,3,8),col.names = targ_cols))
    
    # set the names of this list
    result <- setNames(result, basename(files))
    
    # rowbind these all into one big data table
    result <- rbindlist(result, idcol  = "file_name")
    
    # merge the meta matrix with result, and drop the file_name column
    result <- result[metaMatrix, on="file_name"][,file_name:=NULL]
    
    # swing the result wide, using dcast
    result <- dcast(result, gene_id+gene_name+gene_type~caseid, value.var = "fpkm_unstranded")
    

    Output:

    Key: <gene_id, gene_name, gene_type>
                      gene_id  gene_name      gene_type      S1     S10      S2      S3      S4      S5      S6      S7      S8      S9
                       <char>     <char>         <char>   <num>   <num>   <num>   <num>   <num>   <num>   <num>   <num>   <num>   <num>
        1: ENSG00000000003.15     TSPAN6 protein_coding  2.3871 22.5869  6.4368 15.3576 15.7226  8.5207 12.0493  7.3320  4.6332  4.0896
        2:  ENSG00000000005.6       TNMD protein_coding  0.0148  0.0262  0.0104  0.0000  0.0131  0.0431  0.0000  0.0000  0.0000  0.0274
        3: ENSG00000000419.13       DPM1 protein_coding 47.2737 35.4827 37.7895 18.0376 31.9880 21.8047 33.9062 28.4055 35.2305 38.7990
        4: ENSG00000000457.14      SCYL3 protein_coding  1.0667  1.5429  1.0221  2.5562  1.7876  1.3778  1.8601  1.4731  1.7868  1.0997
        5: ENSG00000000460.17   C1orf112 protein_coding  1.0218  1.4619  1.2092  3.6312  3.0915  1.0475  0.8116  1.1096  3.5854  1.0718
       ---                                                                                                                             
    60656:  ENSG00000288669.1 AC008763.4 protein_coding  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0203  0.0000  0.0000
    60657:  ENSG00000288670.1 AL592295.6         lncRNA  2.9298  3.4835  1.7472  1.7875  1.8250  1.1271  2.2591  2.1847  1.9592  1.3517
    60658:  ENSG00000288671.1 AC006486.3 protein_coding  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
    60659:  ENSG00000288674.1 AL391628.1 protein_coding  0.0023  0.0000  0.0048  0.0016  0.0060  0.0050  0.0107  0.0084  0.0085  0.0042
    60660:  ENSG00000288675.1 AP006621.6 protein_coding  0.1816  0.2758  0.1551  0.3328  0.3099  0.3220  0.2454  0.2414  0.1583  0.3484
    

    Input (metaMatrix)

    structure(list(file_name = c("6d1ffd0c-ba13-463e-b5c4-20d2dab6294d.rna_seq.augmented_star_gene_counts.tsv", 
    "054bcc02-0463-4a2f-94a7-b268c15a47d4.rna_seq.augmented_star_gene_counts.tsv", 
    "5bf0c22a-d47f-4b0d-9d43-760fcaa1b42c.rna_seq.augmented_star_gene_counts.tsv", 
    "d1c3a993-a9b3-4938-949a-a38793413139.rna_seq.augmented_star_gene_counts.tsv", 
    "d1d934ef-2e6d-4d28-8418-9a0b5ef66c0c.rna_seq.augmented_star_gene_counts.tsv", 
    "eb102931-4d5d-469d-82bb-187c6e81d6fb.rna_seq.augmented_star_gene_counts.tsv", 
    "0269349e-aab4-459d-8a8b-99534c01c90e.rna_seq.augmented_star_gene_counts.tsv", 
    "963cb78b-6480-453f-9f62-d92c205f034a.rna_seq.augmented_star_gene_counts.tsv", 
    "de53dd9c-416a-492f-89bf-a79866fd3f2d.rna_seq.augmented_star_gene_counts.tsv", 
    "b92a6b85-da4b-4ce0-bc1e-fd8908f1edba.rna_seq.augmented_star_gene_counts.tsv"
    ), caseid = c("S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", 
    "S9", "S10")), class = "data.frame", row.names = c(NA, -10L))