runixsplitsplitstackshape

split extra-delimited column from prokka gff table with varying number of entries into new columns with NAs (splitstackshape / R)


I have a file including tab separated and semicolon separated data (a prokka annotation file in .gff format). Unfortunately, the semicolon separated part is not consistent in the number of entries.

Fortunately, though, the leading part after the semicolon (e.g. ID= or gene=) is consistent. I would like to prepare this data as input for R (or within R) without different column numbers or empty fields. These are the first lines from the prokka file, with some columns removed:

A1  contig_10   16  192 ID=PROKKA_00004;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_00004;product=hypothetical protein
A1  contig_100  147 353 ID=PROKKA_00036;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_00036;product=hypothetical protein
A1  contig_1000 60  434 ID=PROKKA_00892;inference=ab initio prediction:Prodigal:2.6,protein motif:Pfam:PF05893.8;locus_tag=PROKKA_00892;product=Acyl-CoA reductase (LuxC)
A1  contig_10000    132 434 ID=PROKKA_11822;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_11822;product=hypothetical protein
A1  contig_100003   368 784 ID=PROKKA_96005;gene=fusA_29;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:A5VR09;locus_tag=PROKKA_96005;product=Elongation factor G
A1  contig_100026   38  355 ID=PROKKA_96016;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_96016;product=hypothetical protein
A1  contig_100027   38  493 ID=PROKKA_96018;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_96018;product=hypothetical protein
A1  contig_100028   121 1131    ID=PROKKA_96019;eC_number=3.1.-.-;gene=rnjA_3;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:Q45493;locus_tag=PROKKA_96019;product=Ribonuclease J 1
A1  contig_10003    1028    3307    ID=PROKKA_11824;eC_number=1.1.1.40;gene=maeB_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P76558;locus_tag=PROKKA_11824;product=NADP-dependent malic enzyme

Desired output would be:

  V1            V2  V3  V4 eC_number    gene           ID                                                                 inference    locus_tag note                     product
1 A1     contig_10  16 192      <NA>    <NA> PROKKA_00004                                         ab initio prediction:Prodigal:2.6 PROKKA_00004 <NA>        hypothetical protein
2 A1    contig_100 147 353      <NA>    <NA> PROKKA_00036                                         ab initio prediction:Prodigal:2.6 PROKKA_00036 <NA>        hypothetical protein
3 A1   contig_1000  60 434      <NA>    <NA> PROKKA_00892            ab initio prediction:Prodigal:2.6,protein motif:Pfam:PF05893.8 PROKKA_00892 <NA>   Acyl-CoA reductase (LuxC)
4 A1  contig_10000 132 434      <NA>    <NA> PROKKA_11822                                         ab initio prediction:Prodigal:2.6 PROKKA_11822 <NA>        hypothetical protein
5 A1 contig_100003 368 784      <NA> fusA_29 PROKKA_96005 ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:A5VR09 PROKKA_96005 <NA>         Elongation factor G

Solution

  • An option using tidyverse and splitstackshape can be achieved. First read the file data using say read.table (with argument sep="\t"). Then split the column V5 using splitstackshape::splitstackshape in to different columns. Now data is ready to be changed to long format and processed.

    library(tidyverse)
    library(splitstackshape)
    
    # If first 4 columns of "textdata" is separated by "multiple spaces" than read it as
    df <- read.table(text = gsub("\\s{2,}","\t",textdata), stringsAsFactors = FALSE, sep = "\t")
    
    # If first 4 columns of "textdata" is separated by "tab" than read it as
    df <- read.table(text = textdata, stringsAsFactors = FALSE, sep = "\t")
    
    
    # Now, process data (Based on feedback from `@crazysantaclaus`)
    df %>% cSplit("V5", sep=";") %>%
      gather(Key, value, -c(V1,V2,V3,V4)) %>% 
      separate(value, c("Col","Value"), sep="=") %>% 
      select(-Key) %>% 
      filter(!(is.na(Col) & is.na(Value))) %>% 
      spread(Col, Value)
    

    Result:

    #     V1            V2  V3  V4     col1           col2                             col3   col4                                     col5    col6
    #1    A1 something_101 789 910 STRING_2 string_integer string with whitespace and:colon STRING string with whitespace and special chars    <NA>
    #2    A1 something_100 123 456 STRING_1           <NA> string with whitespace and:colon STRING string with whitespace and special chars  string
    

    Data:

    textdata <- "A1 something_100   123 456 col1=STRING_1;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars;col6=string
    A1  something_101   789 910 col1=STRING_2;col2=string_integer;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars"
    

    Data#2:

    The 2nd set of data. Initial 4 columns are not separated by \t rather those are separated by multiple spaces

    textdata <- "A1      something_100   123     456     col1=STRING_1;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars;col6=string
    A1      something_101   789     910     col1=STRING_2;col2=string_integer;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars"