I have a file including tab separated and semicolon separated data (a prokka annotation file in .gff format). Unfortunately, the semicolon separated part is not consistent in the number of entries.
Fortunately, though, the leading part after the semicolon (e.g. ID=
or gene=
) is consistent. I would like to prepare this data as input for R (or within R) without different column numbers or empty fields. These are the first lines from the prokka file, with some columns removed:
A1 contig_10 16 192 ID=PROKKA_00004;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_00004;product=hypothetical protein
A1 contig_100 147 353 ID=PROKKA_00036;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_00036;product=hypothetical protein
A1 contig_1000 60 434 ID=PROKKA_00892;inference=ab initio prediction:Prodigal:2.6,protein motif:Pfam:PF05893.8;locus_tag=PROKKA_00892;product=Acyl-CoA reductase (LuxC)
A1 contig_10000 132 434 ID=PROKKA_11822;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_11822;product=hypothetical protein
A1 contig_100003 368 784 ID=PROKKA_96005;gene=fusA_29;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:A5VR09;locus_tag=PROKKA_96005;product=Elongation factor G
A1 contig_100026 38 355 ID=PROKKA_96016;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_96016;product=hypothetical protein
A1 contig_100027 38 493 ID=PROKKA_96018;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_96018;product=hypothetical protein
A1 contig_100028 121 1131 ID=PROKKA_96019;eC_number=3.1.-.-;gene=rnjA_3;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:Q45493;locus_tag=PROKKA_96019;product=Ribonuclease J 1
A1 contig_10003 1028 3307 ID=PROKKA_11824;eC_number=1.1.1.40;gene=maeB_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P76558;locus_tag=PROKKA_11824;product=NADP-dependent malic enzyme
Desired output would be:
V1 V2 V3 V4 eC_number gene ID inference locus_tag note product
1 A1 contig_10 16 192 <NA> <NA> PROKKA_00004 ab initio prediction:Prodigal:2.6 PROKKA_00004 <NA> hypothetical protein
2 A1 contig_100 147 353 <NA> <NA> PROKKA_00036 ab initio prediction:Prodigal:2.6 PROKKA_00036 <NA> hypothetical protein
3 A1 contig_1000 60 434 <NA> <NA> PROKKA_00892 ab initio prediction:Prodigal:2.6,protein motif:Pfam:PF05893.8 PROKKA_00892 <NA> Acyl-CoA reductase (LuxC)
4 A1 contig_10000 132 434 <NA> <NA> PROKKA_11822 ab initio prediction:Prodigal:2.6 PROKKA_11822 <NA> hypothetical protein
5 A1 contig_100003 368 784 <NA> fusA_29 PROKKA_96005 ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:A5VR09 PROKKA_96005 <NA> Elongation factor G
An option using tidyverse
and splitstackshape
can be achieved. First read the file data using say read.table
(with argument sep="\t"
). Then split the column V5
using splitstackshape::splitstackshape
in to different columns. Now data is ready to be changed to long format and processed.
library(tidyverse)
library(splitstackshape)
# If first 4 columns of "textdata" is separated by "multiple spaces" than read it as
df <- read.table(text = gsub("\\s{2,}","\t",textdata), stringsAsFactors = FALSE, sep = "\t")
# If first 4 columns of "textdata" is separated by "tab" than read it as
df <- read.table(text = textdata, stringsAsFactors = FALSE, sep = "\t")
# Now, process data (Based on feedback from `@crazysantaclaus`)
df %>% cSplit("V5", sep=";") %>%
gather(Key, value, -c(V1,V2,V3,V4)) %>%
separate(value, c("Col","Value"), sep="=") %>%
select(-Key) %>%
filter(!(is.na(Col) & is.na(Value))) %>%
spread(Col, Value)
Result:
# V1 V2 V3 V4 col1 col2 col3 col4 col5 col6
#1 A1 something_101 789 910 STRING_2 string_integer string with whitespace and:colon STRING string with whitespace and special chars <NA>
#2 A1 something_100 123 456 STRING_1 <NA> string with whitespace and:colon STRING string with whitespace and special chars string
Data:
textdata <- "A1 something_100 123 456 col1=STRING_1;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars;col6=string
A1 something_101 789 910 col1=STRING_2;col2=string_integer;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars"
Data#2:
The 2nd set of data. Initial 4 columns are not separated by \t
rather those are separated by multiple spaces
textdata <- "A1 something_100 123 456 col1=STRING_1;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars;col6=string
A1 something_101 789 910 col1=STRING_2;col2=string_integer;col3=string with whitespace and:colon;col4=STRING;col5=string with whitespace and special chars"