rperformancedata.tablebioinformaticsgtfs

Parse GTF File from Gencode


I wrote a script using the data.table package to parse out the last column of a GENCODE gtf file. The column, for those unaware, contains a handful of key-value items separated by a semi-colon for each row. The particular file I'm working with contains ~2.5 million rows. I indexed out the first 100, then the first 1000 rows just to test the script and the output is exactly what I need. However, despite using the set function, the run-time isn't as fast as I expected. It's instantaneous with the first 100 rows, but takes about a minute or two for the first 1000. Here is the script.

#LOAD DATA.TABLE LIBRARY
require(data.table)
#READ GTF ANNOTATION FILE
info <- fread("gencodeAnnotation.gtf")
colnames(info)[9] <- "AdditionalInfo"
info <- info[1:1000]
#CREATE LIST OF 'KEYS' TO PARSE OUT
pars <- as.character(list("gene_id", "gene_type", "gene_status", "gene_name", " level ", "transcript_name", "transcript_id", "transcript_type", "transcript_support_level", "havana_gene"))
#NESTED FOR LOOP TO PARSE KEY-VALUE PAIR
for (i in 1:length(pars)) {
    for (j in 1:nrow(info)) {
        infoRow <- info[,tstrsplit(AdditionalInfo, ';', fixed = T)][j]
        headerCheck <- like(infoRow, pars[i])
        if (any(headerCheck) == TRUE) {
          keyVal <- length(tstrsplit(infoRow[[which(headerCheck == T)]], " ", fixed = T))
          set(info, i = j, j = toupper(pars[i]), value = tstrsplit(infoRow[[which(headerCheck == T)]], " ", fixed = T)[[keyVal]])   
        } else {
          set(info, i = j, j = toupper(pars[i]), value = NA)    
        }
    }
}

As I said before, the output is perfect when tested on the first 100, 1000 rows. Based on the code, it has to loop through all the rows multiplied by the number of columns to add, or the items in pars. My question is, what's missing in my script or what edits can I make to reduce run-time? Here is the link for the gtf file being used: http://www.gencodegenes.org/releases/current.html. It is the first link labeled "Comprehensive gene annotation". Thanks in advance.

SAMPLE OF WHAT EACH ROW LOOKS LIKE:

gene_id ENSG00000223972.5; gene_type transcribed_unprocessed_pseudogene; gene_status KNOWN; gene_name DDX11L1; level 2; havana_gene OTTHUMG00000000961.2; remap_status full_contig; remap_num_mappings 1; remap_target_status overlap;


Solution

  • I find the readGFF function from the bioconductor package rtracklayer quite appropriate here.

    require(rtracklayer)
    system.time(gtf <- readGFF("~/Downloads/gencode.v24.annotation.gtf", version=2L))
    #    user  system elapsed 
    #  34.558   1.541  36.737 
    dim(gtf)
    # [1] 2572840      26
    

    You can also select just the tags you like.

    system.time(gtf_tags <- readGFF("~/Downloads/gencode.v24.annotation.gtf", version=2L, 
          tags = c("gene_id", "transcript_id")))
    #    user  system elapsed 
    #  16.830   0.733  17.780 
    dim(gtf_tags)
    # [1] 2572840      10